Team SIX Member:

  • Ananta Arora (SID: 100421624)
  • Jinghao Chen (SID: 100406201)
  • Roxanne Alvarez (SID: 100405742)
  • Teshani Jayasinghe (SID: 100422405)

Summary¶

This is a general summary.

  1. Assess the general characteristics of the dataset
  • How many records do we have? How many variables?
  • What are the variable names? Are they meaningful?
  • What type is each variable
  • How many unique values does each variable have?
  • What value occurs most frequently, and how often does it occur?
  • Are there missing observations (vertically and horrizontally)? If so, how frequently does this occur?

** Num of Missing values by columns ** Num of Missing values by rows ** Decision on droping missing values


2. Examine descriptive statistics for each variable
For categorical variables, answer the main questions like: * [How many distinct values or “levels” does the variable exhibit](#dist_cat) * [How often does each of these levels occur in the dataset?](#cat_level) * [How does the behavior of another variable X vary over the levels of C?](#behavior)

For numerical variable, answer the main questions like: * [What is the mean, median, standard deviation?](#numsummary) * [Does the data follow the normal distribution?](#normality) ** [Shapiro-Wilk Test](#shapiro)
  1. Where possible—certainly for any variable of particular interest—examine exploratory

visualizations and identify anomalies

  • Box plot
  • Num of Outliers
  • Bar charts
  1. Look at the relations between key variables using the ideas of visualization and statistical tests
  • Correlation Matrix
  • Point Plots
  • Continues Variables Normalization

** Log Transformation Method ** L2 Normalization ** BoxCox method ** Min-Max Method


* Statistical Tests ** [Continuous Variables](#contvar) ** [Ordinal Variables](#ordinal) ** [Binary Variables](#binary) ** [Summary Table](#summary)

Packages¶

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
from scipy.stats import ttest_ind
from scipy.stats import boxcox
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
In [31]:
# prompt: mount google drive

# from google.colab import drive
# drive.mount('/content/drive')

Setting¶

In [32]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

Data Set¶

Load dataset¶

In [33]:
df = pd.read_csv("filtered_data.csv")
df_schema = pd.read_csv('schema.csv')

Data Exploration¶

There are 18,883 records and 70 variables in the dataset.

In [34]:
# number of records and variables
df.shape
Out[34]:
(18883, 70)

Schema¶

All columns' name, category, data type are listed below:

In [35]:
# variable names
df.columns
Out[35]:
Index(['Hospital Mortality', 'Age', 'Gender', 'Uncomplicated Hypertension',
       'Complicated Hypertension', 'Uncomplicated Diabetes',
       'Complicated Diabetes', 'Malignancy', 'Hematologic Disease',
       'Metastasis', 'Peripheral Vascular Disease', 'Hypothyroidism',
       'Chronic Heart Failure', 'Stroke', 'Liver Disease', 'SAPS II', 'SOFA',
       'OASIS', 'Sepsis', 'Any Organ Failure', 'Severe Respiratory Failure',
       'Severe Coagulation Failure', 'Severe Liver Failure',
       'Severe Cardiovascular Failure',
       'Severe Central Nervous System Failure', 'Severe Renal Failure',
       'Respiratory Dysfunction', 'Cardiovascular Dysfunction',
       'Renal Dysfunction', 'Hematologic Dysfunction', 'Metabolic Dysfunction',
       'Neurologic Dysfunction', 'Max Heart Rate', 'Min Heart Rate',
       'Mean Heart Rate', 'Max MAP', 'Min MAP', 'Mean MAP',
       'Max Systolic Pressure', 'Min Systolic Pressure',
       'Mean Systolic Pressure', 'Max Diastolic Pressure',
       'Min Diastolic Pressure', 'Mean Diastolic Pressure', 'Max Temperature',
       'Min Temperature', 'Mean Temperature', 'Max Lactate', 'Min Lactate',
       'Mean Lactate', 'Max pH', 'Min pH', 'Mean pH', 'Max Glucose',
       'Min Glucose', 'Mean Glucose', 'Max WBC', 'Min WBC', 'Mean WBC',
       'Max BUN', 'Min BUN', 'Mean BUN', 'Max Creatinine', 'Min Creatinine',
       'Mean Creatinine', 'Max Hemoglobin', 'Min Hemoglobin',
       'Mean Hemoglobin', 'Ventilation Duration (h)', 'RRT'],
      dtype='object')

Type of each variable

In [36]:
#Check the data types if correct
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18883 entries, 0 to 18882
Data columns (total 70 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Hospital Mortality                     18883 non-null  int64  
 1   Age                                    18883 non-null  int64  
 2   Gender                                 18883 non-null  object 
 3   Uncomplicated Hypertension             18883 non-null  int64  
 4   Complicated Hypertension               18883 non-null  int64  
 5   Uncomplicated Diabetes                 18883 non-null  int64  
 6   Complicated Diabetes                   18883 non-null  int64  
 7   Malignancy                             18883 non-null  int64  
 8   Hematologic Disease                    18883 non-null  int64  
 9   Metastasis                             18883 non-null  int64  
 10  Peripheral Vascular Disease            18883 non-null  int64  
 11  Hypothyroidism                         18883 non-null  int64  
 12  Chronic Heart Failure                  18883 non-null  int64  
 13  Stroke                                 18883 non-null  int64  
 14  Liver Disease                          18883 non-null  int64  
 15  SAPS II                                18883 non-null  int64  
 16  SOFA                                   18883 non-null  int64  
 17  OASIS                                  18883 non-null  int64  
 18  Sepsis                                 18883 non-null  int64  
 19  Any Organ Failure                      18883 non-null  int64  
 20  Severe Respiratory Failure             18883 non-null  int64  
 21  Severe Coagulation Failure             18883 non-null  int64  
 22  Severe Liver Failure                   18883 non-null  int64  
 23  Severe Cardiovascular Failure          18883 non-null  int64  
 24  Severe Central Nervous System Failure  18883 non-null  int64  
 25  Severe Renal Failure                   18883 non-null  int64  
 26  Respiratory Dysfunction                18883 non-null  int64  
 27  Cardiovascular Dysfunction             18883 non-null  int64  
 28  Renal Dysfunction                      18883 non-null  int64  
 29  Hematologic Dysfunction                18883 non-null  int64  
 30  Metabolic Dysfunction                  18883 non-null  int64  
 31  Neurologic Dysfunction                 18883 non-null  int64  
 32  Max Heart Rate                         18842 non-null  float64
 33  Min Heart Rate                         18842 non-null  float64
 34  Mean Heart Rate                        18842 non-null  float64
 35  Max MAP                                18841 non-null  float64
 36  Min MAP                                18841 non-null  float64
 37  Mean MAP                               18841 non-null  float64
 38  Max Systolic Pressure                  18823 non-null  float64
 39  Min Systolic Pressure                  18823 non-null  float64
 40  Mean Systolic Pressure                 18823 non-null  float64
 41  Max Diastolic Pressure                 18822 non-null  float64
 42  Min Diastolic Pressure                 18822 non-null  float64
 43  Mean Diastolic Pressure                18822 non-null  float64
 44  Max Temperature                        18196 non-null  float64
 45  Min Temperature                        18196 non-null  float64
 46  Mean Temperature                       18196 non-null  float64
 47  Max Lactate                            13782 non-null  float64
 48  Min Lactate                            13782 non-null  float64
 49  Mean Lactate                           13782 non-null  float64
 50  Max pH                                 17710 non-null  float64
 51  Min pH                                 17710 non-null  float64
 52  Mean pH                                17710 non-null  float64
 53  Max Glucose                            18808 non-null  float64
 54  Min Glucose                            18808 non-null  float64
 55  Mean Glucose                           18808 non-null  float64
 56  Max WBC                                18654 non-null  float64
 57  Min WBC                                18654 non-null  float64
 58  Mean WBC                               18654 non-null  float64
 59  Max BUN                                18788 non-null  float64
 60  Min BUN                                18788 non-null  float64
 61  Mean BUN                               18788 non-null  float64
 62  Max Creatinine                         18788 non-null  float64
 63  Min Creatinine                         18788 non-null  float64
 64  Mean Creatinine                        18788 non-null  float64
 65  Max Hemoglobin                         18792 non-null  float64
 66  Min Hemoglobin                         18792 non-null  float64
 67  Mean Hemoglobin                        18792 non-null  float64
 68  Ventilation Duration (h)               18386 non-null  float64
 69  RRT                                    18883 non-null  int64  
dtypes: float64(37), int64(32), object(1)
memory usage: 10.1+ MB
In [37]:
df_schema.set_index('variable_name', inplace=True)
df_schema
Out[37]:
category variable_type
variable_name
Hospital Mortality Target binary
Age Demographic continuous
Gender Demographic binary
Uncomplicated Hypertension Medical history binary
Complicated Hypertension Medical history binary
Uncomplicated Diabetes Medical history binary
Complicated Diabetes Medical history binary
Malignancy Medical history binary
Hematologic Disease Medical history binary
Metastasis Medical history binary
Peripheral Vascular Disease Medical history binary
Hypothyroidism Medical history binary
Chronic Heart Failure Medical history binary
Stroke Medical history binary
Liver Disease Medical history binary
SAPS II Disease severity ordinal
SOFA Disease severity ordinal
OASIS Disease severity ordinal
Sepsis Diagnosis binary
Any Organ Failure Diagnosis binary
Severe Respiratory Failure Diagnosis binary
Severe Coagulation Failure Diagnosis binary
Severe Liver Failure Diagnosis binary
Severe Cardiovascular Failure Diagnosis binary
Severe Central Nervous System Failure Diagnosis binary
Severe Renal Failure Diagnosis binary
Respiratory Dysfunction Diagnosis binary
Cardiovascular Dysfunction Diagnosis binary
Renal Dysfunction Diagnosis binary
Hematologic Dysfunction Diagnosis binary
Metabolic Dysfunction Diagnosis binary
Neurologic Dysfunction Diagnosis binary
Max Heart Rate Vital signs continuous
Min Heart Rate Vital signs continuous
Mean Heart Rate Vital signs continuous
Max MAP Vital signs continuous
Min MAP Vital signs continuous
Mean MAP Vital signs continuous
Max Systolic Pressure Vital signs continuous
Min Systolic Pressure Vital signs continuous
Mean Systolic Pressure Vital signs continuous
Max Diastolic Pressure Vital signs continuous
Min Diastolic Pressure Vital signs continuous
Mean Diastolic Pressure Vital signs continuous
Max Temperature Vital signs continuous
Min Temperature Vital signs continuous
Mean Temperature Vital signs continuous
Max Lactate Laboratory results continuous
Min Lactate Laboratory results continuous
Mean Lactate Laboratory results continuous
Max pH Laboratory results continuous
Min pH Laboratory results continuous
Mean pH Laboratory results continuous
Max Glucose Laboratory results continuous
Min Glucose Laboratory results continuous
Mean Glucose Laboratory results continuous
Max WBC Laboratory results continuous
Min WBC Laboratory results continuous
Mean WBC Laboratory results continuous
Max BUN Laboratory results continuous
Min BUN Laboratory results continuous
Mean BUN Laboratory results continuous
Max Creatinine Laboratory results continuous
Min Creatinine Laboratory results continuous
Mean Creatinine Laboratory results continuous
Max Hemoglobin Laboratory results continuous
Min Hemoglobin Laboratory results continuous
Mean Hemoglobin Laboratory results continuous
Ventilation Duration (h) Treatment continuous
RRT Treatment binary
Schema Visualizations¶
In [59]:
# Calculate the percentages
df_percentages = df_schema['category'].value_counts(normalize=True) * 100

# Plot the percentages
df_percentages.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.title('Percentage of Variables in Each Category _ 67 Features')
plt.gca().spines[['top', 'right']].set_visible(False)
No description has been provided for this image
In [60]:
# Calculate the percentages
df_percentages = df_schema['variable_type'].value_counts(normalize=True) * 100

# Plot the percentages
df_percentages.plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.title('Percentage of Variables in Each Type _ 67 Features')
plt.gca().spines[['top', 'right']].set_visible(False)
No description has been provided for this image

Unique Values¶

The number of unique values per variable are shown below

In [40]:
# number of unique values per variable
df.nunique()
Out[40]:
Hospital Mortality                           2
Age                                         72
Gender                                       2
Uncomplicated Hypertension                   2
Complicated Hypertension                     2
Uncomplicated Diabetes                       2
Complicated Diabetes                         2
Malignancy                                   2
Hematologic Disease                          2
Metastasis                                   2
Peripheral Vascular Disease                  2
Hypothyroidism                               2
Chronic Heart Failure                        2
Stroke                                       2
Liver Disease                                2
SAPS II                                    107
SOFA                                        23
OASIS                                       64
Sepsis                                       2
Any Organ Failure                            2
Severe Respiratory Failure                   2
Severe Coagulation Failure                   2
Severe Liver Failure                         2
Severe Cardiovascular Failure                2
Severe Central Nervous System Failure        2
Severe Renal Failure                         2
Respiratory Dysfunction                      2
Cardiovascular Dysfunction                   2
Renal Dysfunction                            2
Hematologic Dysfunction                      2
Metabolic Dysfunction                        2
Neurologic Dysfunction                       2
Max Heart Rate                             162
Min Heart Rate                             132
Mean Heart Rate                          12730
Max MAP                                    426
Min MAP                                    258
Mean MAP                                 14561
Max Systolic Pressure                      205
Min Systolic Pressure                      167
Mean Systolic Pressure                   13300
Max Diastolic Pressure                     179
Min Diastolic Pressure                      92
Mean Diastolic Pressure                  11778
Max Temperature                            334
Min Temperature                            369
Mean Temperature                         11507
Max Lactate                                239
Min Lactate                                182
Mean Lactate                               946
Max pH                                      89
Min pH                                      95
Mean pH                                     78
Max Glucose                                573
Min Glucose                                332
Mean Glucose                              4984
Max WBC                                    553
Min WBC                                    445
Mean WBC                                  1974
Max BUN                                    167
Min BUN                                    151
Mean BUN                                  1077
Max Creatinine                             144
Min Creatinine                             123
Mean Creatinine                            606
Max Hemoglobin                             143
Min Hemoglobin                             159
Mean Hemoglobin                            934
Ventilation Duration (h)                  5895
RRT                                          2
dtype: int64

Most frequent value

Most Frequently occuring value per variable

In [41]:
# most frequently occuring value and the count
most_frequent_values = {}
for column in df.columns:
    most_common = df[column].value_counts().idxmax()
    count = df[column].value_counts().max()
    most_frequent_values[column] = {'value': most_common, 'count': count}

# DataFrame from the dictionary
result_df = pd.DataFrame(most_frequent_values).T

print(result_df)
                                           value   count
Hospital Mortality                             0   15866
Age                                           77     504
Gender                                         M   11457
Uncomplicated Hypertension                     0   10123
Complicated Hypertension                       0   17418
Uncomplicated Diabetes                         0   14980
Complicated Diabetes                           0   17924
Malignancy                                     0   16871
Hematologic Disease                            0   16070
Metastasis                                     0   18055
Peripheral Vascular Disease                    0   17244
Hypothyroidism                                 0   17307
Chronic Heart Failure                          0   14348
Stroke                                         0   17825
Liver Disease                                  0   17085
SAPS II                                       34     692
SOFA                                           4    2872
OASIS                                         35     972
Sepsis                                         0   16063
Any Organ Failure                              1    9574
Severe Respiratory Failure                     0   17663
Severe Coagulation Failure                     0   18783
Severe Liver Failure                           0   18663
Severe Cardiovascular Failure                  0   16581
Severe Central Nervous System Failure          0   17784
Severe Renal Failure                           0   17965
Respiratory Dysfunction                        0   14003
Cardiovascular Dysfunction                     0   16329
Renal Dysfunction                              0   14244
Hematologic Dysfunction                        0   16858
Metabolic Dysfunction                          0   17010
Neurologic Dysfunction                         0   17190
Max Heart Rate                              88.0   534.0
Min Heart Rate                              70.0   647.0
Mean Heart Rate                             87.0    26.0
Max MAP                                     93.0   484.0
Min MAP                                     58.0   773.0
Mean MAP                                    74.0    26.0
Max Systolic Pressure                      150.0   396.0
Min Systolic Pressure                       85.0   601.0
Mean Systolic Pressure                     108.0    26.0
Max Diastolic Pressure                      80.0   572.0
Min Diastolic Pressure                      45.0   863.0
Mean Diastolic Pressure                     60.0    37.0
Max Temperature                             37.5   645.0
Min Temperature                        36.111111   607.0
Mean Temperature                       36.944444    51.0
Max Lactate                                  2.0   461.0
Min Lactate                                  1.0   976.0
Mean Lactate                                 1.4   366.0
Max pH                                      7.44  1224.0
Min pH                                      7.32   963.0
Mean pH                                     7.38  1421.0
Max Glucose                                165.0   177.0
Min Glucose                                 99.0   343.0
Mean Glucose                               130.0    76.0
Max WBC                                     12.3   162.0
Min WBC                                     10.2   193.0
Mean WBC                                     9.7    94.0
Max BUN                                     15.0  1041.0
Min BUN                                     13.0  1128.0
Mean BUN                                    14.0   544.0
Max Creatinine                               0.8  2237.0
Min Creatinine                               0.7  2678.0
Mean Creatinine                              0.8  1211.0
Max Hemoglobin                              12.7   403.0
Min Hemoglobin                               9.4   369.0
Mean Hemoglobin                              9.7   138.0
Ventilation Duration (h)                     4.0   354.0
RRT                                            0   18328

Data set Motification¶

Find missing values

Num of missing values by columns
In [42]:
# missing values column-wise
na_count = df.isnull().sum() # total count
na_pct = (na_count/len(df))*100 # percentage

na_df = pd.DataFrame({'Count': na_count.values,
                      'Percentage (%)': na_pct}).reset_index().rename(columns = {'index': 'Feature'})

na_df.sort_values(by='Percentage (%)', ascending=False)
Out[42]:
Feature Count Percentage (%)
49 Mean Lactate 5101 27.013716
48 Min Lactate 5101 27.013716
47 Max Lactate 5101 27.013716
50 Max pH 1173 6.211937
51 Min pH 1173 6.211937
52 Mean pH 1173 6.211937
44 Max Temperature 687 3.638193
45 Min Temperature 687 3.638193
46 Mean Temperature 687 3.638193
68 Ventilation Duration (h) 497 2.631997
58 Mean WBC 229 1.212731
57 Min WBC 229 1.212731
56 Max WBC 229 1.212731
64 Mean Creatinine 95 0.503098
63 Min Creatinine 95 0.503098
62 Max Creatinine 95 0.503098
61 Mean BUN 95 0.503098
60 Min BUN 95 0.503098
59 Max BUN 95 0.503098
65 Max Hemoglobin 91 0.481915
66 Min Hemoglobin 91 0.481915
67 Mean Hemoglobin 91 0.481915
55 Mean Glucose 75 0.397183
53 Max Glucose 75 0.397183
54 Min Glucose 75 0.397183
43 Mean Diastolic Pressure 61 0.323042
42 Min Diastolic Pressure 61 0.323042
41 Max Diastolic Pressure 61 0.323042
38 Max Systolic Pressure 60 0.317746
39 Min Systolic Pressure 60 0.317746
40 Mean Systolic Pressure 60 0.317746
35 Max MAP 42 0.222422
37 Mean MAP 42 0.222422
36 Min MAP 42 0.222422
32 Max Heart Rate 41 0.217127
33 Min Heart Rate 41 0.217127
34 Mean Heart Rate 41 0.217127
0 Hospital Mortality 0 0.000000
1 Age 0 0.000000
31 Neurologic Dysfunction 0 0.000000
2 Gender 0 0.000000
3 Uncomplicated Hypertension 0 0.000000
4 Complicated Hypertension 0 0.000000
5 Uncomplicated Diabetes 0 0.000000
6 Complicated Diabetes 0 0.000000
7 Malignancy 0 0.000000
8 Hematologic Disease 0 0.000000
9 Metastasis 0 0.000000
10 Peripheral Vascular Disease 0 0.000000
11 Hypothyroidism 0 0.000000
12 Chronic Heart Failure 0 0.000000
13 Stroke 0 0.000000
14 Liver Disease 0 0.000000
15 SAPS II 0 0.000000
16 SOFA 0 0.000000
17 OASIS 0 0.000000
18 Sepsis 0 0.000000
19 Any Organ Failure 0 0.000000
20 Severe Respiratory Failure 0 0.000000
21 Severe Coagulation Failure 0 0.000000
22 Severe Liver Failure 0 0.000000
23 Severe Cardiovascular Failure 0 0.000000
24 Severe Central Nervous System Failure 0 0.000000
25 Severe Renal Failure 0 0.000000
26 Respiratory Dysfunction 0 0.000000
27 Cardiovascular Dysfunction 0 0.000000
28 Renal Dysfunction 0 0.000000
29 Hematologic Dysfunction 0 0.000000
30 Metabolic Dysfunction 0 0.000000
69 RRT 0 0.000000
Num of missing values by rows
In [43]:
temp_df = df.copy()
# Calculate missing values by row
missing_values_by_row = df.isnull().sum(axis=1)

# Add the missing values count to the original DataFrame
temp_df["MissingValuesCount"] = missing_values_by_row

total_rows = len(temp_df)
temp_df["MissingValuesPercentage"] = (temp_df["MissingValuesCount"] / total_rows) * 100

# Sort the DataFrame by the "MissingValuesCount" column in descending order
df_sorted = temp_df.sort_values(by="MissingValuesPercentage", ascending=False)

# Print the sorted DataFrame
print(df_sorted[['MissingValuesCount','MissingValuesPercentage']].head(100))
       MissingValuesCount  MissingValuesPercentage
6766                   36                 0.190648
11565                  36                 0.190648
960                    36                 0.190648
6480                   36                 0.190648
8888                   27                 0.142986
8750                   25                 0.132394
1794                   24                 0.127098
4749                   24                 0.127098
4900                   24                 0.127098
13876                  24                 0.127098
13534                  24                 0.127098
10804                  24                 0.127098
4171                   24                 0.127098
1682                   22                 0.116507
4219                   22                 0.116507
1509                   22                 0.116507
8097                   22                 0.116507
11525                  22                 0.116507
15573                  22                 0.116507
12752                  21                 0.111211
18151                  21                 0.111211
18715                  21                 0.111211
12726                  21                 0.111211
17788                  21                 0.111211
12586                  21                 0.111211
12849                  21                 0.111211
3212                   21                 0.111211
8123                   21                 0.111211
10182                  21                 0.111211
18534                  21                 0.111211
1462                   21                 0.111211
2913                   21                 0.111211
13524                  21                 0.111211
5816                   21                 0.111211
17881                  21                 0.111211
4744                   21                 0.111211
1447                   21                 0.111211
15116                  21                 0.111211
14456                  21                 0.111211
10515                  21                 0.111211
16846                  21                 0.111211
18680                  21                 0.111211
6245                   21                 0.111211
16475                  21                 0.111211
13882                  21                 0.111211
3341                   21                 0.111211
11789                  21                 0.111211
15382                  21                 0.111211
9566                   21                 0.111211
7869                   21                 0.111211
14267                  21                 0.111211
6116                   21                 0.111211
6595                   21                 0.111211
2351                   21                 0.111211
18018                  21                 0.111211
11422                  21                 0.111211
4541                   21                 0.111211
3911                   21                 0.111211
10088                  21                 0.111211
8692                   19                 0.100620
7663                   19                 0.100620
17549                  18                 0.095324
10339                  18                 0.095324
4369                   18                 0.095324
13242                  18                 0.095324
1589                   18                 0.095324
8800                   18                 0.095324
2248                   18                 0.095324
9356                   18                 0.095324
14094                  18                 0.095324
3458                   18                 0.095324
947                    18                 0.095324
14802                  18                 0.095324
16359                  18                 0.095324
18190                  18                 0.095324
9073                   18                 0.095324
924                    16                 0.084732
10246                  16                 0.084732
11980                  15                 0.079437
12658                  15                 0.079437
3926                   15                 0.079437
16373                  15                 0.079437
17372                  15                 0.079437
13374                  15                 0.079437
6557                   15                 0.079437
3017                   15                 0.079437
3224                   15                 0.079437
9248                   15                 0.079437
18260                  15                 0.079437
16385                  15                 0.079437
341                    15                 0.079437
18001                  15                 0.079437
16029                  15                 0.079437
11143                  15                 0.079437
16852                  15                 0.079437
15831                  15                 0.079437
9668                   15                 0.079437
4584                   15                 0.079437
7438                   15                 0.079437
687                    15                 0.079437
Visualize the missing values¶
In [44]:
# heatmap
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()
No description has been provided for this image

Removing Missing values

In [45]:
before_clear_mv = df.shape
print(f'Before clean missing values, the dataset has {before_clear_mv[0]} rows and {before_clear_mv[1]} variables')
Before clean missing values, the dataset has 18883 rows and 70 variables
In [46]:
# drop the columns RRT and Ventilation Duration (h) in df
df_orig = df.copy()
df = df.drop(['RRT', 'Ventilation Duration (h)'], axis=1)
In [47]:
def calculate_missing_percentages(df):
    df_copy = df.copy()  # Create a copy of the DataFrame
    missing_percentages = df_copy.isna().sum(axis=1) / len(df.columns) * 100
    missing_percentages = missing_percentages.round(2)

    # Create a new column to store the percentage of missing values
    df_copy['missing_percentage'] = missing_percentages

    # Group the DataFrame by the 'missing_percentage' column and count the number of observations in each group
    counts = df_copy.groupby('missing_percentage').size().reset_index(name='count')

    # Sort the table by percentage from highest to lowest
    counts = counts.sort_values(by='missing_percentage', ascending=False)

    # Print the table
    print(counts)
In [48]:
dem_f = df_schema[df_schema['category'] == 'Demographic'].index
medical_f = df_schema[df_schema['category'] == 'Medical history'].index
severe_f = df_schema[df_schema['category'] == 'Disease severity'].index
diag_f = df_schema[df_schema['category'] == 'Diagnosis'].index
vital_f = df_schema[df_schema['category'] == 'Vital signs'].index
lab_f = df_schema[df_schema['category'] == 'Laboratory results'].index

dem_df = df[dem_f]
medical_df = df[medical_f]
severe_df = df[severe_f]
diag_df = df[diag_f]
vital_df = df[vital_f]
lab_df = df[lab_f]
In [49]:
print('Demographic')
print(calculate_missing_percentages(dem_df))
print('-----------------------------')
print()
print('Medical history')
print(calculate_missing_percentages(medical_df))
print('-----------------------------')
print()
print('Disease severity')
print(calculate_missing_percentages(severe_df))
print('-----------------------------')
print()
print('Diagnosis')
print(calculate_missing_percentages(diag_df))
print('-----------------------------')
print()
print('Vital signs')
print(calculate_missing_percentages(vital_df))
print('-----------------------------')
print()
print('Laboratory results')
print(calculate_missing_percentages(lab_df))
Demographic
   missing_percentage  count
0                 0.0  18883
None
-----------------------------

Medical history
   missing_percentage  count
0                 0.0  18883
None
-----------------------------

Disease severity
   missing_percentage  count
0                 0.0  18883
None
-----------------------------

Diagnosis
   missing_percentage  count
0                 0.0  18883
None
-----------------------------

Vital signs
   missing_percentage  count
4               100.0     41
3                60.0      1
2                40.0     18
1                20.0    647
0                 0.0  18176
None
-----------------------------

Laboratory results
   missing_percentage  count
7              100.00     50
6               85.71      9
5               71.43      6
4               57.14     15
3               42.86     28
2               28.57    968
1               14.29   4345
0                0.00  13462
None
In [50]:
df.dropna(subset=['Max Heart Rate', 'Min Heart Rate', 'Mean Heart Rate',
                  'Max MAP', 'Min MAP', 'Mean MAP',
                  'Max Systolic Pressure', 'Min Systolic Pressure', 'Mean Systolic Pressure',
                  'Max Diastolic Pressure', 'Min Diastolic Pressure', 'Mean Diastolic Pressure',
                  'Max Temperature', 'Min Temperature', 'Mean Temperature'],
          inplace=True)
In [51]:
df.shape
Out[51]:
(18176, 68)
In [52]:
df.dropna(subset=['Max Lactate', 'Min Lactate', 'Mean Lactate',
                    'Max pH', 'Min pH', 'Mean pH',
                    'Max Glucose', 'Min Glucose', 'Mean Glucose',
                    'Max WBC', 'Min WBC', 'Mean WBC',
                    'Max BUN', 'Min BUN', 'Mean BUN',
                    'Max Creatinine', 'Min Creatinine', 'Mean Creatinine',
                    'Max Hemoglobin', 'Min Hemoglobin', 'Mean Hemoglobin'],
          inplace=True)
In [53]:
after_clean_missing_values = df.shape
After Clear up missing values¶
In [54]:
print(f'Before clean missing values, the dataset has {before_clear_mv[0]} rows and {before_clear_mv[1]} variables')
print(f'After clean missing values, the dataset has {after_clean_missing_values[0]} rows and {after_clean_missing_values[1]} variables')
Before clean missing values, the dataset has 18883 rows and 70 variables
After clean missing values, the dataset has 12799 rows and 68 variables
Viz after clear up missing value¶
In [55]:
# heatmap
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()
No description has been provided for this image

Valid Range¶

Make sure all variables are within valid range

In [56]:
df = df.loc[
    ((df['Max Heart Rate'] >= 0) & (df['Max Heart Rate'] <= 350))
    & ((df['Min Heart Rate'] >= 0) & (df['Max Heart Rate'] <= 350))
    & ((df['Mean Heart Rate'] >= 0) & (df['Mean Heart Rate'] <= 350))
    & ((df['Max MAP']  >= 14) & (df['Max MAP'] <= 330))
    & ((df['Min MAP']  >= 14) & (df['Min MAP'] <= 330))
    & ((df['Mean MAP']  >= 14) & (df['Mean MAP'] <= 330))
    & ((df['Min Systolic Pressure'] >= 0) & (df['Min Systolic Pressure'] <= 375))
    & ((df['Max Systolic Pressure'] >= 0) & (df['Max Systolic Pressure'] <= 375))
    & ((df['Mean Systolic Pressure'] >= 0) & (df['Mean Systolic Pressure'] <= 375))
    & ((df['Min Diastolic Pressure'] >= 0) & (df['Min Diastolic Pressure'] <= 375))
    & ((df['Max Diastolic Pressure'] >= 0) & (df['Max Diastolic Pressure'] <= 375))
    & ((df['Mean Diastolic Pressure'] >= 0) & (df['Mean Diastolic Pressure'] <= 375))
    & ((df['Min Temperature'] >= 26)& (df['Min Temperature'] <= 45))
    & ((df['Max Temperature'] >= 26)& (df['Max Temperature'] <= 45))
    & ((df['Mean Temperature'] >= 26)& (df['Mean Temperature'] <= 45))
    & ((df['Min pH']  >= 0)& (df['Min pH'] <= 14))
    & ((df['Max pH']  >= 0)& (df['Max pH'] <= 14))
    & ((df['Mean pH']  >= 0)& (df['Mean pH'] <= 14))
    & ((df['Min Lactate']  >= 0.4)& (df['Min Lactate'] <= 30))
    & ((df['Max Lactate']  >= 0.4)& (df['Max Lactate'] <= 30))
    & ((df['Mean Lactate']  >= 0.4)& (df['Mean Lactate'] <= 30))
    & ((df['Min Glucose'] >= 33)& (df['Min Glucose'] <= 2000))
    & ((df['Max Glucose'] >= 33)& (df['Max Glucose'] <= 2000))
    & ((df['Mean Glucose'] >= 33)& (df['Mean Glucose'] <= 2000))
    & ((df['Min WBC'] >= 0)& (df['Min WBC'] <= 1000))
    & ((df['Max WBC'] >= 0)& (df['Max WBC'] <= 1000))
    & ((df['Mean WBC'] >= 0)& (df['Mean WBC'] <= 1000))
    & ((df['Min BUN'] >= 0)& (df['Min BUN'] <= 250))
    & ((df['Max BUN'] >= 0)& (df['Max BUN'] <= 250))
    & ((df['Mean BUN'] >= 0)& (df['Mean BUN'] <= 250))
    & ((df['Min Creatinine'] >= 0.1)& (df['Min Creatinine'] <= 60))
    & ((df['Max Creatinine'] >= 0.1)& (df['Max Creatinine'] <= 60))
    & ((df['Mean Creatinine'] >= 0.1)& (df['Mean Creatinine'] <= 60))
    & ((df['Min Hemoglobin'] >= 0) & (df['Min Hemoglobin'] <= 25))
    & ((df['Max Hemoglobin'] >= 0) & (df['Max Hemoglobin'] <= 25))
    & ((df['Mean Hemoglobin'] >= 0) & (df['Mean Hemoglobin'] <= 25))
]
In [57]:
print(df['Hospital Mortality'].value_counts())
Hospital Mortality
0    10331
1     2158
Name: count, dtype: int64
In [58]:
df.to_csv("/content/data_new.csv")
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[58], line 1
----> 1 df.to_csv("/content/data_new.csv")

File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\util\_decorators.py:333, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    327 if len(args) > num_allow_args:
    328     warnings.warn(
    329         msg.format(arguments=_format_argument_list(allow_args)),
    330         FutureWarning,
    331         stacklevel=find_stack_level(),
    332     )
--> 333 return func(*args, **kwargs)

File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\generic.py:3964, in NDFrame.to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options)
   3953 df = self if isinstance(self, ABCDataFrame) else self.to_frame()
   3955 formatter = DataFrameFormatter(
   3956     frame=df,
   3957     header=header,
   (...)
   3961     decimal=decimal,
   3962 )
-> 3964 return DataFrameRenderer(formatter).to_csv(
   3965     path_or_buf,
   3966     lineterminator=lineterminator,
   3967     sep=sep,
   3968     encoding=encoding,
   3969     errors=errors,
   3970     compression=compression,
   3971     quoting=quoting,
   3972     columns=columns,
   3973     index_label=index_label,
   3974     mode=mode,
   3975     chunksize=chunksize,
   3976     quotechar=quotechar,
   3977     date_format=date_format,
   3978     doublequote=doublequote,
   3979     escapechar=escapechar,
   3980     storage_options=storage_options,
   3981 )

File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\formats\format.py:1014, in DataFrameRenderer.to_csv(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, errors, storage_options)
    993     created_buffer = False
    995 csv_formatter = CSVFormatter(
    996     path_or_buf=path_or_buf,
    997     lineterminator=lineterminator,
   (...)
   1012     formatter=self.fmt,
   1013 )
-> 1014 csv_formatter.save()
   1016 if created_buffer:
   1017     assert isinstance(path_or_buf, StringIO)

File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\formats\csvs.py:251, in CSVFormatter.save(self)
    247 """
    248 Create the writer & save.
    249 """
    250 # apply compression and byte/text conversion
--> 251 with get_handle(
    252     self.filepath_or_buffer,
    253     self.mode,
    254     encoding=self.encoding,
    255     errors=self.errors,
    256     compression=self.compression,
    257     storage_options=self.storage_options,
    258 ) as handles:
    259     # Note: self.encoding is irrelevant here
    260     self.writer = csvlib.writer(
    261         handles.handle,
    262         lineterminator=self.lineterminator,
   (...)
    267         quotechar=self.quotechar,
    268     )
    270     self._save()

File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\common.py:749, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    747 # Only for write methods
    748 if "r" not in mode and is_path:
--> 749     check_parent_directory(str(handle))
    751 if compression:
    752     if compression != "zstd":
    753         # compression libraries do not like an explicit text-mode

File c:\Users\JingH\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\io\common.py:616, in check_parent_directory(path)
    614 parent = Path(path).parent
    615 if not parent.is_dir():
--> 616     raise OSError(rf"Cannot save file into a non-existent directory: '{parent}'")

OSError: Cannot save file into a non-existent directory: '\content'

Create subsets¶

In [ ]:
cont_dem = df_schema[(df_schema['variable_type'] == 'continuous') & (df_schema['category'] == 'Demographic')].index
cont_vital = df_schema[(df_schema['variable_type'] == 'continuous') & (df_schema['category'] == 'Vital signs')].index
cont_lab = df_schema[
    (df_schema['variable_type'] == 'continuous')
    & (df_schema['category'] == 'Laboratory results')
    & df_schema.index.str.contains('Lactate|Crea|Hemog')
].index
cont_lab2 = df_schema[
    (df_schema['variable_type'] == 'continuous')
    & (df_schema['category'] == 'Laboratory results')
    & df_schema.index.str.contains('Glucose|WBC|BUN|pH')
].index
binary = df_schema[((df_schema['variable_type']=='binary') | (df_schema.index == 'Hospital Mortality')) & (df_schema.index != 'RRT')].index
ordinal = df_schema[((df_schema['variable_type'] == 'ordinal') | (df_schema.index == 'Hospital Mortality'))].index
cont_all = df_schema[((df_schema['variable_type'] == 'continuous') | (df_schema.index == 'Hospital Mortality')) & (df_schema['category'] != 'Treatment')].index
In [ ]:
cont_dem_df = df[cont_dem]
cont_vital_df = df[cont_vital]
cont_lab_df = df[cont_lab]
cont_lab2_df = df[cont_lab2]
cont_all_df = df[cont_all]
binary_df = df[binary]
ordinal_df = df[ordinal]

Categorical Variables¶

Distinct Value in Categorical Variables
In [ ]:
cat_feats = ['Hospital Mortality', 'Gender', 'Uncomplicated Hypertension',
       'Complicated Hypertension', 'Uncomplicated Diabetes',
       'Complicated Diabetes', 'Malignancy', 'Hematologic Disease',
       'Metastasis', 'Peripheral Vascular Disease', 'Hypothyroidism',
       'Chronic Heart Failure', 'Stroke', 'Liver Disease',
        'Sepsis', 'Any Organ Failure', 'Severe Respiratory Failure',
       'Severe Coagulation Failure', 'Severe Liver Failure',
       'Severe Cardiovascular Failure',
       'Severe Central Nervous System Failure', 'Severe Renal Failure',
       'Respiratory Dysfunction', 'Cardiovascular Dysfunction',
       'Renal Dysfunction', 'Hematologic Dysfunction', 'Metabolic Dysfunction',
       'Neurologic Dysfunction']
categorical_stats = df[cat_feats].apply(lambda x: x.nunique())
categorical_stats
Out[ ]:
Hospital Mortality                       2
Gender                                   2
Uncomplicated Hypertension               2
Complicated Hypertension                 2
Uncomplicated Diabetes                   2
Complicated Diabetes                     2
Malignancy                               2
Hematologic Disease                      2
Metastasis                               2
Peripheral Vascular Disease              2
Hypothyroidism                           2
Chronic Heart Failure                    2
Stroke                                   2
Liver Disease                            2
Sepsis                                   2
Any Organ Failure                        2
Severe Respiratory Failure               2
Severe Coagulation Failure               2
Severe Liver Failure                     2
Severe Cardiovascular Failure            2
Severe Central Nervous System Failure    2
Severe Renal Failure                     2
Respiratory Dysfunction                  2
Cardiovascular Dysfunction               2
Renal Dysfunction                        2
Hematologic Dysfunction                  2
Metabolic Dysfunction                    2
Neurologic Dysfunction                   2
dtype: int64
In [ ]:
print(f'There are total {len(cat_feats)} categorical variables')
There are total 28 categorical variables
Level of Categorical Variables
In [ ]:
value_counts_all = df[cat_feats].apply(pd.Series.value_counts)
value_counts_all
Out[ ]:
Hospital Mortality Gender Uncomplicated Hypertension Complicated Hypertension Uncomplicated Diabetes Complicated Diabetes Malignancy Hematologic Disease Metastasis Peripheral Vascular Disease Hypothyroidism Chronic Heart Failure Stroke Liver Disease Sepsis Any Organ Failure Severe Respiratory Failure Severe Coagulation Failure Severe Liver Failure Severe Cardiovascular Failure Severe Central Nervous System Failure Severe Renal Failure Respiratory Dysfunction Cardiovascular Dysfunction Renal Dysfunction Hematologic Dysfunction Metabolic Dysfunction Neurologic Dysfunction
0 10331.0 NaN 7001.0 11427.0 10013.0 11821.0 11021.0 10430.0 11876.0 11448.0 11428.0 9504.0 11795.0 11151.0 10228.0 5462.0 11493.0 12408.0 12325.0 10650.0 11753.0 11803.0 8811.0 10485.0 8963.0 10995.0 10963.0 11245.0
1 2158.0 NaN 5488.0 1062.0 2476.0 668.0 1468.0 2059.0 613.0 1041.0 1061.0 2985.0 694.0 1338.0 2261.0 7027.0 996.0 81.0 164.0 1839.0 736.0 686.0 3678.0 2004.0 3526.0 1494.0 1526.0 1244.0
F NaN 4891.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
M NaN 7598.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [ ]:
df_cat = df[cat_feats]

# Count occurrences of 0 or 1 in each column
counts_0 = df_cat.apply(lambda x: x.isin([0]).sum())
counts_1 = df_cat.apply(lambda x: x.isin([1]).sum())

# Calculate percentage
percentages_0 = (counts_0 / len(df)) * 100
percentages_1 = (counts_1 / len(df)) * 100

perc_table = pd.DataFrame({'%0': percentages_0, '%1': percentages_1})

print(perc_table)
                                              %0         %1
Hospital Mortality                     82.720794  17.279206
Gender                                  0.000000   0.000000
Uncomplicated Hypertension             56.057330  43.942670
Complicated Hypertension               91.496517   8.503483
Uncomplicated Diabetes                 80.174554  19.825446
Complicated Diabetes                   94.651293   5.348707
Malignancy                             88.245656  11.754344
Hematologic Disease                    83.513492  16.486508
Metastasis                             95.091681   4.908319
Peripheral Vascular Disease            91.664665   8.335335
Hypothyroidism                         91.504524   8.495476
Chronic Heart Failure                  76.098967  23.901033
Stroke                                 94.443110   5.556890
Liver Disease                          89.286572  10.713428
Sepsis                                 81.896069  18.103931
Any Organ Failure                      43.734486  56.265514
Severe Respiratory Failure             92.024982   7.975018
Severe Coagulation Failure             99.351429   0.648571
Severe Liver Failure                   98.686844   1.313156
Severe Cardiovascular Failure          85.275042  14.724958
Severe Central Nervous System Failure  94.106814   5.893186
Severe Renal Failure                   94.507166   5.492834
Respiratory Dysfunction                70.550084  29.449916
Cardiovascular Dysfunction             83.953879  16.046121
Renal Dysfunction                      71.767155  28.232845
Hematologic Dysfunction                88.037473  11.962527
Metabolic Dysfunction                  87.781247  12.218753
Neurologic Dysfunction                 90.039235   9.960765
The behavior of another variable X vary over the levels of C
In [ ]:
# First, select the columns you want to visualize
columns_to_visualize = cont_all_df # You can customize this based on your needs

# Remove the target column from the list (we don't want to plot it against itself)
# columns_to_visualize.remove('Hospital Mortality')

# Now, let's create density plots for each column against the target variable
for column in columns_to_visualize:
  if column == 'Hospital Mortality':
    pass
  else:
    fig, ax = plt.subplots(figsize=(10, 6))
    sns.kdeplot(data=df[df['Hospital Mortality'] == 0][column], ax = ax, label='Alive', fill=True, color = 'g')
    sns.kdeplot(data=df[df['Hospital Mortality'] == 1][column], ax = ax, label='Dead', fill=True, color = 'r')
    sns.kdeplot(data=df[column], label='Overall Classes',ax = ax, fill=True,color='b')

    plt.title(f'Density Plot of {column} by Hospital Mortality')
    plt.xlabel(column)
    plt.ylabel('Density')
    plt.legend(title='Target')
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Numerical Variables¶

summary for continuous variables
In [ ]:
# Separate numerical and categorical columns
numeric_features =  ['Max Heart Rate', 'Min Heart Rate',
       'Mean Heart Rate', 'Max MAP', 'Min MAP', 'Mean MAP',
       'Max Systolic Pressure', 'Min Systolic Pressure',
       'Mean Systolic Pressure', 'Max Diastolic Pressure',
       'Min Diastolic Pressure', 'Mean Diastolic Pressure', 'Max Temperature',
       'Min Temperature', 'Mean Temperature', 'Max Lactate', 'Min Lactate',
       'Mean Lactate', 'Max pH', 'Min pH', 'Mean pH', 'Max Glucose',
       'Min Glucose', 'Mean Glucose', 'Max WBC', 'Min WBC', 'Mean WBC',
       'Max BUN', 'Min BUN', 'Mean BUN', 'Max Creatinine', 'Min Creatinine',
       'Mean Creatinine', 'Max Hemoglobin', 'Min Hemoglobin',
       'Mean Hemoglobin', 'SAPS II', 'SOFA',
       'OASIS']
print(len(numeric_features))
39
In [ ]:
subset_df = df[numeric_features]

# Get descriptive statistics for the selected features
statistics = subset_df.describe()

mean = statistics.loc['mean']
median = statistics.loc['50%']  # Median is the 50th percentile
std_dev = statistics.loc['std']

# Create a DataFrame to display the statistics in a table format
statistics_table = pd.DataFrame({'Mean': mean, 'Median': median, 'Standard Deviation': std_dev})

print(statistics_table)
                               Mean      Median  Standard Deviation
Max Heart Rate           107.984474  106.000000           20.573834
Min Heart Rate            71.973765   71.000000           15.531557
Mean Heart Rate           88.430129   86.862069           15.372059
Max MAP                  109.473430  103.000000           30.144985
Min MAP                   57.380068   58.000000           12.192861
Mean MAP                  78.237488   76.725000           10.222848
Max Systolic Pressure    152.713268  149.000000           24.243929
Min Systolic Pressure     87.046799   87.000000           17.442935
Mean Systolic Pressure   117.109939  114.826087           15.521249
Max Diastolic Pressure    83.701257   81.000000           18.191668
Min Diastolic Pressure    43.468572   44.000000           10.873250
Mean Diastolic Pressure   60.474160   59.657895            9.743214
Max Temperature           37.749812   37.722222            0.844661
Min Temperature           36.030913   36.111111            0.969213
Mean Temperature          36.962726   36.966667            0.724690
Max Lactate                3.351364    2.500000            2.793236
Min Lactate                1.809492    1.400000            1.397299
Mean Lactate               2.525750    2.000000            1.890930
Max pH                     7.431130    7.440000            0.072189
Min pH                     7.289433    7.300000            0.107355
Mean pH                    7.364707    7.370000            0.070655
Max Glucose              192.261766  173.000000           88.173353
Min Glucose              109.329090  103.000000           36.344911
Mean Glucose             147.195135  136.000000           47.835281
Max WBC                   15.648373   14.200000           11.421410
Min WBC                   10.962118   10.000000            8.138209
Mean WBC                  13.227665   12.100000            9.482562
Max BUN                   25.997998   19.000000           20.530685
Min BUN                   21.464008   16.000000           17.646050
Mean BUN                  23.677328   17.500000           18.896375
Max Creatinine             1.440171    1.000000            1.404564
Min Creatinine             1.164977    0.800000            1.142817
Mean Creatinine            1.297446    0.930000            1.258200
Max Hemoglobin            12.418833   12.400000            1.961007
Min Hemoglobin             9.727840    9.600000            2.192672
Mean Hemoglobin           10.961597   10.730000            1.851872
SAPS II                   39.696933   38.000000           15.151403
SOFA                       5.412923    5.000000            3.419823
OASIS                     36.362399   36.000000            8.186966
Normality Check

Normality check for the continuous variables before normalization. Notice through visual inspection that the outliers are causing the shape to be skewed.

In [ ]:
# histogram and QQ plot for each column
def plot_histogram_qqplot(data, column):
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))

    # Histogram with median and mean reference lines
    sns.histplot(data[column], kde=True, ax=axes[0])
    median_val = data[column].median()
    mean_val = data[column].mean()
    axes[0].axvline(median_val, color='r', linestyle='dashed', linewidth=2, label=f'Median: {median_val:.2f}')
    axes[0].axvline(mean_val, color='g', linestyle='dashed', linewidth=2, label=f'Mean: {mean_val:.2f}')
    axes[0].set_title(f'Histogram - {column}')
    axes[0].legend()

    # QQ Plot
    sm.qqplot(data[column], line='s', ax=axes[1])
    axes[1].set_title(f'QQ Plot - {column}')

    plt.show()

# Iterate through columns and create plots
for column in cont_all_df.columns:
    if column != 'Hospital Mortality':
        plot_histogram_qqplot(cont_all_df, column)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Shapiro-Wilk Test

Shapiro Wilk test conducted on the dataset without missing values. The result for all variables is "not normally distributed."

In [ ]:
# Perform Shapiro-Wilk test on each column in cont_all_df
for column in cont_all_df.columns[1:]:
    stat, p = stats.shapiro(cont_all_df[column])
    print(f"Shapiro-Wilk test for {column}:")
    print(f"  Statistic: {stat}")
    print(f"  p-value: {p}")
    print("")

    if p < 0.05:
        print(f"The distribution of {column} is significantly different from normal.")
        print("---------------------------------------------------------------------")
    else:
        print(f"The distribution of {column} is not significantly different from normal.")
        print("---------------------------------------------------------------------")
Shapiro-Wilk test for Age:
  Statistic: 0.9579778909683228
  p-value: 0.0

The distribution of Age is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Heart Rate:
  Statistic: 0.9716181755065918
  p-value: 6.866362475191604e-44

The distribution of Max Heart Rate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Heart Rate:
  Statistic: 0.9904375672340393
  p-value: 4.8635451107712535e-28

The distribution of Min Heart Rate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Heart Rate:
  Statistic: 0.9883667230606079
  p-value: 1.270662467504408e-30

The distribution of Mean Heart Rate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max MAP:
  Statistic: 0.7134469747543335
  p-value: 0.0

The distribution of Max MAP is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min MAP:
  Statistic: 0.9794363975524902
  p-value: 8.243728163044217e-39

The distribution of Min MAP is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean MAP:
  Statistic: 0.9694177508354187
  p-value: 4.203895392974451e-45

The distribution of Mean MAP is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Systolic Pressure:
  Statistic: 0.9592243432998657
  p-value: 0.0

The distribution of Max Systolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Systolic Pressure:
  Statistic: 0.9820803999900818
  p-value: 9.711820751813725e-37

The distribution of Min Systolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Systolic Pressure:
  Statistic: 0.9709146022796631
  p-value: 2.6624670822171524e-44

The distribution of Mean Systolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Diastolic Pressure:
  Statistic: 0.8902511596679688
  p-value: 0.0

The distribution of Max Diastolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Diastolic Pressure:
  Statistic: 0.9921280741691589
  p-value: 1.3540282213339945e-25

The distribution of Min Diastolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Diastolic Pressure:
  Statistic: 0.984017014503479
  p-value: 4.607433333966349e-35

The distribution of Mean Diastolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Temperature:
  Statistic: 0.9718073606491089
  p-value: 8.828180325246348e-44

The distribution of Max Temperature is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Temperature:
  Statistic: 0.9212244153022766
  p-value: 0.0

The distribution of Min Temperature is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Temperature:
  Statistic: 0.9613032341003418
  p-value: 0.0

The distribution of Mean Temperature is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Lactate:
  Statistic: 0.7228389382362366
  p-value: 0.0

The distribution of Max Lactate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Lactate:
  Statistic: 0.6343636512756348
  p-value: 0.0

The distribution of Min Lactate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Lactate:
  Statistic: 0.7003828287124634
  p-value: 0.0

The distribution of Mean Lactate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max pH:
  Statistic: 0.9707695245742798
  p-value: 2.2420775429197073e-44

The distribution of Max pH is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min pH:
  Statistic: 0.9386768341064453
  p-value: 0.0

The distribution of Min pH is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean pH:
  Statistic: 0.9484682083129883
  p-value: 0.0

The distribution of Mean pH is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Glucose:
  Statistic: 0.7231720685958862
  p-value: 0.0

The distribution of Max Glucose is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Glucose:
  Statistic: 0.8644881248474121
  p-value: 0.0

The distribution of Min Glucose is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Glucose:
  Statistic: 0.8101949095726013
  p-value: 0.0

The distribution of Mean Glucose is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max WBC:
  Statistic: 0.4640159606933594
  p-value: 0.0

The distribution of Max WBC is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min WBC:
  Statistic: 0.5044788718223572
  p-value: 0.0

The distribution of Min WBC is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean WBC:
  Statistic: 0.46362489461898804
  p-value: 0.0

The distribution of Mean WBC is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max BUN:
  Statistic: 0.7306792736053467
  p-value: 0.0

The distribution of Max BUN is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min BUN:
  Statistic: 0.7287629842758179
  p-value: 0.0

The distribution of Min BUN is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean BUN:
  Statistic: 0.7304162383079529
  p-value: 0.0

The distribution of Mean BUN is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Creatinine:
  Statistic: 0.5539169311523438
  p-value: 0.0

The distribution of Max Creatinine is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Creatinine:
  Statistic: 0.5376086235046387
  p-value: 0.0

The distribution of Min Creatinine is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Creatinine:
  Statistic: 0.5448676347732544
  p-value: 0.0

The distribution of Mean Creatinine is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Hemoglobin:
  Statistic: 0.9970400333404541
  p-value: 4.320630784165934e-15

The distribution of Max Hemoglobin is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Hemoglobin:
  Statistic: 0.9944753050804138
  p-value: 1.8967766414239853e-21

The distribution of Min Hemoglobin is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Hemoglobin:
  Statistic: 0.9793708920478821
  p-value: 7.3712713313648e-39

The distribution of Mean Hemoglobin is significantly different from normal.
---------------------------------------------------------------------
/usr/local/lib/python3.10/dist-packages/scipy/stats/_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

Visualization Exploration¶

Box plots

Demographics¶

In [ ]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=cont_dem_df)
plt.title("Box Plot for Continuous - Demographics")
plt.show()
No description has been provided for this image

Vital Signs¶

In [ ]:
plt.figure(figsize=(40, 24))
sns.boxplot(data=cont_vital_df)
plt.title("Box Plots for Continuous - Vital Signs")
plt.show()
No description has been provided for this image

Laboratory Results¶

In [ ]:
plt.figure(figsize=(40, 24))
sns.boxplot(data=cont_lab_df)
plt.title("Box Plots for Continuous - Lab 1")
plt.show()
No description has been provided for this image
In [ ]:
plt.figure(figsize=(40, 24))
sns.boxplot(data=cont_lab2_df)
plt.title("Box Plots for Continuous - Lab 2")
plt.show()
No description has been provided for this image

Num of Outliers

In [ ]:
contVar = df.select_dtypes(include=['float64']).columns
In [ ]:
Q1 = df[contVar.values].quantile(0.25)
Q3 = df[contVar.values].quantile(0.75)
IQR = Q3 - Q1

# Find outliers for each continuous variable
outliersCount = ((df[contVar.values] < (Q1 - 1.5 * IQR)) | (df[contVar.values] > (Q3 + 1.5 * IQR))).sum()

# Display the number of outliers for each variable
print("Number of outliers for each continuous variable:")
print(outliersCount)
Number of outliers for each continuous variable:
Max Heart Rate              205
Min Heart Rate              294
Mean Heart Rate             166
Max MAP                     724
Min MAP                     657
Mean MAP                    311
Max Systolic Pressure       273
Min Systolic Pressure       415
Mean Systolic Pressure      296
Max Diastolic Pressure      403
Min Diastolic Pressure      368
Mean Diastolic Pressure     253
Max Temperature             291
Min Temperature             390
Mean Temperature            279
Max Lactate                 992
Min Lactate                1005
Mean Lactate                919
Max pH                      258
Min pH                      477
Mean pH                     362
Max Glucose                 766
Min Glucose                 535
Mean Glucose                785
Max WBC                     456
Min WBC                     430
Mean WBC                    413
Max BUN                    1082
Min BUN                    1032
Mean BUN                   1132
Max Creatinine             1275
Min Creatinine             1315
Mean Creatinine            1361
Max Hemoglobin               92
Min Hemoglobin               76
Mean Hemoglobin             137
dtype: int64

Bar Charts

In [ ]:
import matplotlib.pyplot as plt

# Set a consistent figure size for all plots
fig_size = (8, 6)

for column in binary_df.columns[1:]:
    # Calculate the percentage of patients in each category for each group
    percentages = binary_df.groupby('Hospital Mortality')[column].value_counts(normalize=True) * 100

    # Bar graph
    plt.figure(figsize=fig_size)  # Set the figure size
    ax = percentages.unstack().plot(kind='bar')

    # Add percentage labels on top of each bar
    for p in ax.patches:
        ax.annotate(f'{p.get_height():.2f}%', (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='center', xytext=(0, 10), textcoords='offset points')

    plt.title(f'Percentage of Patients in Each Category of {column} by Hospital Mortality')
    plt.xlabel('Hospital Mortality')
    plt.ylabel('Percentage')
    plt.xticks(ticks=[0, 1], labels=['Survived', 'Died'], rotation=0)  # Set rotation to 0 for horizontal labels

    # legend outside the plot area
    plt.legend(title=column, bbox_to_anchor=(1.05, 1), loc='upper left')

    # Add an extra tick
    plt.yticks(list(plt.yticks()[0]) + [plt.yticks()[0][-1] + 10])

    plt.show()
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image
<Figure size 800x600 with 0 Axes>
No description has been provided for this image

Correlation Matrix

In [ ]:
# correlation_matrix = df[contVar].corr()
# plt.figure(figsize=(22, 20))
# sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=.5)
# plt.title("Heatmap of Correlation Matrix")
# plt.show()

Correlation Matrix without max and min variables¶

In [ ]:
# Exclude 'Hospital Mortality' and columns containing 'Min' or 'Max'
cols_to_exclude = ['Hospital Mortality'] + [col for col in cont_all_df.columns if 'Min' in col or 'Max' in col]

# correlation matrix
corr_matrix = cont_all_df.drop(cols_to_exclude, axis=1).corr()

# Round the correlation values to 2 decimal places
corr_matrix = corr_matrix.round(2)

# heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix for Continuous Variables')
plt.show()
No description has been provided for this image

Point plots

In [ ]:
plt.figure(figsize=(14, 8))
for variable in contVar:
    sns.pointplot(x='Hospital Mortality', y=variable, data=df, errorbar='sd', dodge=True)
    plt.title(f'Point Plot of {variable}')
    plt.xlabel('Hospital Mortality')
    plt.ylabel(f'{variable}')
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Data Nomalization on Continues variables

Log Transformation Method

In [ ]:
log_transformed_data = cont_all_df.copy()

for column in cont_all_df.columns:
    if column != 'Hospital Mortality':
        log_transformed_data[column] = np.log(cont_all_df[column] + 1)


print(log_transformed_data.head())
    Hospital Mortality       Age  Max Heart Rate  Min Heart Rate  \
0                    0  4.356709        5.129899        4.330733   
1                    1  3.761200        4.718499        4.418841   
2                    1  4.290459        4.663439        4.276666   
9                    1  4.290459        4.007333        3.891820   
13                   0  4.343805        4.867534        4.521789   

    Mean Heart Rate   Max MAP   Min MAP  Mean MAP  Max Systolic Pressure  \
0          4.725490  5.560682  3.713572  4.339808               5.384495   
1          4.537961  4.890349  4.219508  4.603669               5.384495   
2          4.463936  4.709530  4.304065  4.490881               5.111988   
9          3.955672  4.605170  4.158883  4.393499               4.990433   
13         4.668946  4.488636  3.995137  4.304065               4.744932   

    Min Systolic Pressure  Mean Systolic Pressure  Max Diastolic Pressure  \
0                4.174387                4.644006                4.317488   
1                4.672829                5.077515                4.682131   
2                4.595120                4.861583                4.382027   
9                4.454347                4.746269                4.317488   
13               4.219508                4.557030                4.343805   

    Min Diastolic Pressure  Mean Diastolic Pressure  Max Temperature  \
0                 3.367296                 4.038127         3.653252   
1                 3.988984                 4.388568         3.660709   
2                 4.025352                 4.187922         3.654978   
9                 3.891820                 4.109612         3.639047   
13                3.663562                 4.148675         3.660709   

    Min Temperature  Mean Temperature  Max Lactate  Min Lactate  Mean Lactate  \
0          3.616309          3.637662     2.282382     1.131402      1.769855   
1          3.597312          3.638885     1.308333     1.064711      1.202972   
2          3.597312          3.627434     2.778819     1.098612      2.363680   
9          3.633191          3.636489     0.916291     0.875469      0.896088   
13         3.572658          3.624933     1.335001     1.335001      1.335001   

      Max pH    Min pH   Mean pH  Max Glucose  Min Glucose  Mean Glucose  \
0   2.150599  2.111425  2.123458     5.703782     4.521789      5.420092   
1   2.131797  2.127041  2.129421     5.187386     4.867534      5.041811   
2   2.122262  2.081938  2.104134     5.257495     4.488636      5.129899   
9   2.136531  2.135349  2.136531     4.762174     4.624973      4.700480   
13  2.118662  2.118662  2.118662     4.927254     4.553877      4.757891   

     Max WBC   Min WBC  Mean WBC   Max BUN   Min BUN  Mean BUN  \
0   3.234749  2.509599  2.904713  3.988984  3.737670  3.823192   
1   2.687847  2.140066  2.451005  2.890372  2.833213  2.862201   
2   2.240710  2.174752  2.208274  3.688879  3.367296  3.540959   
9   2.066863  2.066863  2.066863  2.639057  2.484907  2.564949   
13  2.954910  2.954910  2.954910  4.025352  3.761200  3.901973   

    Max Creatinine  Min Creatinine  Mean Creatinine  Max Hemoglobin  \
0         1.435085        1.223775         1.294727        2.624669   
1         0.875469        0.788457         0.832909        2.797281   
2         0.993252        0.832909         0.916291        2.660260   
9         0.641854        0.530628         0.587787        2.451005   
13        1.280934        1.029619         1.163151        2.602690   

    Min Hemoglobin  Mean Hemoglobin  
0         2.174752         2.401525  
1         2.631889         2.714695  
2         2.174752         2.418589  
9         2.451005         2.451005  
13        2.602690         2.602690  

Data Distribution after normalization. There is an improvement in terms of normality for some variables but the rest are still skewed so we will be using Mann-WHitney test for most of the continuous variables.

Visualization after log tranformation method¶

In [ ]:
for column in log_transformed_data.columns:
    if column != 'Hospital Mortality':
        plot_histogram_qqplot(log_transformed_data, column)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Shapiro Wilk Test After Log Transformation¶

Applied Shapiro Wilk test on the normalized data using log transformation. The result still shows that the data are not normally distruted however visual inspection tells us otherwise.

In [ ]:
# Perform Shapiro-Wilk test on each column in log_transformed_data
for column in log_transformed_data.columns[1:]:
    stat, p = stats.shapiro(log_transformed_data[column])
    print(f"Shapiro-Wilk test for {column}:")
    print(f"  Statistic: {stat}")
    print(f"  p-value: {p}")
    print("")

    if p < 0.05:
        print(f"The distribution of {column} is significantly different from normal.")
        print("---------------------------------------------------------------------")
    else:
        print(f"The distribution of {column} is not significantly different from normal.")
        print("---------------------------------------------------------------------")
Shapiro-Wilk test for Age:
  Statistic: 0.8753691911697388
  p-value: 0.0

The distribution of Age is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Heart Rate:
  Statistic: 0.9981642365455627
  p-value: 4.8528583929119407e-11

The distribution of Max Heart Rate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Heart Rate:
  Statistic: 0.8807404041290283
  p-value: 0.0

The distribution of Min Heart Rate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Heart Rate:
  Statistic: 0.9987273216247559
  p-value: 2.0667322075951233e-08

The distribution of Mean Heart Rate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max MAP:
  Statistic: 0.8827296495437622
  p-value: 0.0

The distribution of Max MAP is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min MAP:
  Statistic: 0.8855717778205872
  p-value: 0.0

The distribution of Min MAP is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean MAP:
  Statistic: 0.9907329678535461
  p-value: 1.228228599741419e-27

The distribution of Mean MAP is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Systolic Pressure:
  Statistic: 0.9919484257698059
  p-value: 7.146964553654495e-26

The distribution of Max Systolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Systolic Pressure:
  Statistic: 0.8003720641136169
  p-value: 0.0

The distribution of Min Systolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Systolic Pressure:
  Statistic: 0.989392876625061
  p-value: 2.166625597062738e-29

The distribution of Mean Systolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Diastolic Pressure:
  Statistic: 0.980971097946167
  p-value: 1.232273283038572e-37

The distribution of Max Diastolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Diastolic Pressure:
  Statistic: 0.9163969159126282
  p-value: 0.0

The distribution of Min Diastolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Diastolic Pressure:
  Statistic: 0.9966349005699158
  p-value: 2.676607172664644e-16

The distribution of Mean Diastolic Pressure is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Temperature:
  Statistic: 0.9670045375823975
  p-value: 0.0

The distribution of Max Temperature is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Temperature:
  Statistic: 0.898748517036438
  p-value: 0.0

The distribution of Min Temperature is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Temperature:
  Statistic: 0.9532508850097656
  p-value: 0.0

The distribution of Mean Temperature is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Lactate:
  Statistic: 0.9504865407943726
  p-value: 0.0

The distribution of Max Lactate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Lactate:
  Statistic: 0.9007787108421326
  p-value: 0.0

The distribution of Min Lactate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Lactate:
  Statistic: 0.9346543550491333
  p-value: 0.0

The distribution of Mean Lactate is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max pH:
  Statistic: 0.968603253364563
  p-value: 1.401298464324817e-45

The distribution of Max pH is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min pH:
  Statistic: 0.9329463243484497
  p-value: 0.0

The distribution of Min pH is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean pH:
  Statistic: 0.9446377158164978
  p-value: 0.0

The distribution of Mean pH is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Glucose:
  Statistic: 0.9644086956977844
  p-value: 0.0

The distribution of Max Glucose is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Glucose:
  Statistic: 0.9811251759529114
  p-value: 1.6319274165958327e-37

The distribution of Min Glucose is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Glucose:
  Statistic: 0.961662769317627
  p-value: 0.0

The distribution of Mean Glucose is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max WBC:
  Statistic: 0.9587361216545105
  p-value: 0.0

The distribution of Max WBC is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min WBC:
  Statistic: 0.9652097225189209
  p-value: 0.0

The distribution of Min WBC is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean WBC:
  Statistic: 0.956272304058075
  p-value: 0.0

The distribution of Mean WBC is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max BUN:
  Statistic: 0.9715502858161926
  p-value: 6.305843089461677e-44

The distribution of Max BUN is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min BUN:
  Statistic: 0.9770194888114929
  p-value: 1.5867182771242769e-40

The distribution of Min BUN is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean BUN:
  Statistic: 0.9725632071495056
  p-value: 2.4242463432819335e-43

The distribution of Mean BUN is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Creatinine:
  Statistic: 0.8170948028564453
  p-value: 0.0

The distribution of Max Creatinine is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Creatinine:
  Statistic: 0.800369918346405
  p-value: 0.0

The distribution of Min Creatinine is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Creatinine:
  Statistic: 0.8064248561859131
  p-value: 0.0

The distribution of Mean Creatinine is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Max Hemoglobin:
  Statistic: 0.9962196350097656
  p-value: 1.9383733575538924e-17

The distribution of Max Hemoglobin is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Min Hemoglobin:
  Statistic: 0.9910911917686462
  p-value: 3.893604805505054e-27

The distribution of Min Hemoglobin is significantly different from normal.
---------------------------------------------------------------------
Shapiro-Wilk test for Mean Hemoglobin:
  Statistic: 0.9963561296463013
  p-value: 4.490599763789102e-17

The distribution of Mean Hemoglobin is significantly different from normal.
---------------------------------------------------------------------
/usr/local/lib/python3.10/dist-packages/scipy/stats/_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

L2 Normalization

L2 Normalization cannot normalize the data for statistical tests.

In [ ]:
X = cont_all_df.to_numpy()
norms = np.linalg.norm(X, axis=0)
normalized_data = X / norms
normalized_df = pd.DataFrame(normalized_data, columns=cont_all_df.columns)
In [ ]:
normalized_df.head()
Out[ ]:
Hospital Mortality Age Max Heart Rate Min Heart Rate Mean Heart Rate Max MAP Min MAP Mean MAP Max Systolic Pressure Min Systolic Pressure Mean Systolic Pressure Max Diastolic Pressure Min Diastolic Pressure Mean Diastolic Pressure Max Temperature Min Temperature Mean Temperature Max Lactate Min Lactate Mean Lactate Max pH Min pH Mean pH Max Glucose Min Glucose Mean Glucose Max WBC Min WBC Mean WBC Max BUN Min BUN Mean BUN Max Creatinine Min Creatinine Mean Creatinine Max Hemoglobin Min Hemoglobin Mean Hemoglobin
0 0.000000 0.010717 0.013675 0.009115 0.011144 0.020411 0.006102 0.008584 0.012558 0.006451 0.007799 0.007731 0.005592 0.008140 0.008910 0.008987 0.008956 0.018049 0.008220 0.013812 0.009139 0.008911 0.008942 0.012649 0.007068 0.013003 0.011270 0.007406 0.009490 0.014316 0.013204 0.013219 0.014234 0.013160 0.013121 0.009110 0.006999 0.008081
1 0.021527 0.005846 0.009036 0.009965 0.009222 0.010402 0.010220 0.011210 0.012558 0.010684 0.012072 0.011178 0.010584 0.011617 0.008979 0.008813 0.008967 0.005538 0.007437 0.006608 0.008946 0.009071 0.009003 0.007530 0.010019 0.008889 0.006328 0.004916 0.005828 0.004592 0.005153 0.004874 0.006227 0.006580 0.006437 0.010960 0.011576 0.011349
2 0.021527 0.010021 0.008547 0.008629 0.008557 0.008669 0.011135 0.010003 0.009549 0.009878 0.009713 0.008253 0.010984 0.009479 0.008926 0.008813 0.008863 0.030971 0.007828 0.027312 0.008850 0.008617 0.008748 0.008080 0.006835 0.009713 0.003880 0.005112 0.004453 0.010535 0.009017 0.009896 0.007562 0.007128 0.007427 0.009466 0.006999 0.008234
3 0.021527 0.010021 0.004396 0.005833 0.005107 0.007802 0.009610 0.009064 0.008449 0.008567 0.008647 0.007731 0.009586 0.008754 0.008781 0.009144 0.008945 0.003077 0.005480 0.004112 0.008995 0.009157 0.009076 0.004907 0.007844 0.006302 0.003187 0.004522 0.003794 0.003512 0.003542 0.003545 0.004003 0.003838 0.003961 0.007544 0.009512 0.008532
4 0.000000 0.010578 0.010501 0.011059 0.010526 0.006935 0.008136 0.008279 0.006597 0.006753 0.007143 0.007940 0.007589 0.009108 0.008979 0.008592 0.008840 0.005743 0.010959 0.007941 0.008814 0.008985 0.008893 0.005796 0.007301 0.006678 0.008406 0.011929 0.010006 0.014857 0.013526 0.014326 0.011565 0.009870 0.010893 0.008896 0.011217 0.010061

Visualization after L2 Normalization¶

In [ ]:
for column in normalized_df.columns:
    if column != 'Hospital Mortality':
        plot_histogram_qqplot(normalized_df, column)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

BoxCox Method

In [ ]:
# Apply Box-Cox transformation to all numeric columns
box_cox_data = cont_all_df.copy()  # Create a copy of the original DataFrame

# Define a function to apply Box-Cox transformation
def apply_boxcox(x):
    # Skip non-numeric columns
    if not pd.api.types.is_numeric_dtype(x):
        return x

    # Check if the column contains any zero or negative values
    if (x <= 0).any():
        # Add a small constant to ensure all values are strictly positive
        x += np.abs(x.min()) + 1e-6

    # Apply Box-Cox transformation
    transformed_data, _ = boxcox(x)
    return transformed_data

# Apply the function to all columns using applymap
box_cox_df = box_cox_data.applymap(apply_boxcox)

# Now transformed_df contains Box-Cox transformed values for all numeric columns

Visualization After BoxCox Method¶

In [ ]:
for column in normalized_df.columns:
    if column != 'Hospital Mortality':
        plot_histogram_qqplot(box_cox_df, column)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Min-Max Method

In [ ]:
scaler = MinMaxScaler()

# Apply Min-Max scaling to all columns
scaled_data = scaler.fit_transform(cont_all_df)

# Convert the scaled data array back to a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=cont_all_df.columns)

Visualization After Min-Max Method¶

In [ ]:
# Iterate through columns and create plots
for column in scaled_df.columns:
    if column != 'Hospital Mortality':  # Skip the target column
        plot_histogram_qqplot(scaled_df, column)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Statistical Tests¶

In [ ]:
# Create an empty dictionary to store results assuming not normal distribution
result_dict = {'Variable': [], 'Data_Type': [], 'Type_of_Test': [], 'P-value': []}

# Create an empty dictionary to store results assuming normal distribution
# result_dict_n = {'Variable': [], 'Data_Type': [], 'Type_of_Test': [], 'P-value': []}

Continuous Variables

Independant t-Test

If we use the dataset without the missing values, majority of results for the ttest are statistically significant. However, since the data are not normally distributed, we will use a nonpametric test.


Assumptions of independent t-test

  • Independence of the observations. Each subject should belong to only one group. There is no relationship between the observations in each group.
  • No significant outliers in the two groups
  • Normality. the data for each group should be approximately normally distributed. (Central Limit Theorem)
  • Homogeneity of variances. the variance of the outcome variable should be equal in each group. Recall that, the Welch t-test does not make this assumptions.

Levene Test for homogeneity¶

  • H0: the variances of groups are equal
  • H1: the variances of groups are NOT equal

All columns return p value less than 0.05. This mean the variances of groups are NOT equal.

In [ ]:
# Conduct a levene test on cont_all_df to test homogeneity of variances.

# Extract the numeric columns
numeric_cols = cont_all_df.select_dtypes(include=['int64', 'float64']).columns

# Initialize empty lists to store results
levene_results = []

# Loop through each numeric column
for col in numeric_cols:
    # Perform Levene's test
    levene_statistic, levene_pvalue = stats.levene(cont_all_df[col][cont_all_df['Hospital Mortality'] == 0], cont_all_df[col][cont_all_df['Hospital Mortality'] == 1])

    # Store the results
    if levene_pvalue > 0.05:
      levene_results.append({'Variable': col, 'Levene Statistic': levene_statistic, 'Levene p-value': levene_pvalue, 'Homogeneity': 'Yes'})
    else:
      levene_results.append({'Variable': col, 'Levene Statistic': levene_statistic, 'Levene p-value': levene_pvalue, 'Homogeneity': 'No'})

# Create a DataFrame to display the results
levene_df = pd.DataFrame(levene_results)

# Print the DataFrame
print(levene_df.to_string())
                   Variable  Levene Statistic  Levene p-value Homogeneity
0        Hospital Mortality               NaN             NaN          No
1                       Age          9.766476    1.781301e-03          No
2            Max Heart Rate        228.954740    2.867679e-51          No
3            Min Heart Rate        381.172645    1.214381e-83          No
4           Mean Heart Rate        242.538863    3.550201e-54          No
5                   Max MAP         51.127977    9.135754e-13          No
6                   Min MAP        280.885248    2.308827e-62          No
7                  Mean MAP        128.331355    1.324610e-29          No
8     Max Systolic Pressure        151.111596    1.568772e-34          No
9     Min Systolic Pressure        309.132049    2.244147e-68          No
10   Mean Systolic Pressure        225.880693    1.306078e-50          No
11   Max Diastolic Pressure         67.658065    2.136216e-16          No
12   Min Diastolic Pressure        167.034840    5.756374e-38          No
13  Mean Diastolic Pressure         77.281635    1.674975e-18          No
14          Max Temperature        586.364446   1.255342e-126          No
15          Min Temperature        329.538161    1.039833e-72          No
16         Mean Temperature        644.300648   1.228729e-138          No
17              Max Lactate        850.401647   6.381367e-181          No
18              Min Lactate        820.758774   7.002810e-175          No
19             Mean Lactate       1041.349674   1.656900e-219          No
20                   Max pH        364.313198    4.453588e-80          No
21                   Min pH        940.224449   3.843229e-199          No
22                  Mean pH       1061.326595   1.638447e-223          No
23              Max Glucose        241.999709    4.629889e-54          No
24              Min Glucose        361.134361    2.095750e-79          No
25             Mean Glucose        393.565497    2.934593e-86          No
26                  Max WBC        146.193383    1.811252e-33          No
27                  Min WBC        224.203857    2.986887e-50          No
28                 Mean WBC        190.961710    4.069839e-43          No
29                  Max BUN        373.597778    4.843634e-82          No
30                  Min BUN        453.030987    8.886399e-99          No
31                 Mean BUN        417.703660    2.390378e-91          No
32           Max Creatinine        149.299426    3.863182e-34          No
33           Min Creatinine        167.901130    3.744817e-38          No
34          Mean Creatinine        158.743041    3.534169e-36          No
35           Max Hemoglobin         59.074669    1.631323e-14          No
36           Min Hemoglobin         15.581030    7.947696e-05          No
37          Mean Hemoglobin         39.784147    2.932115e-10          No
/usr/local/lib/python3.10/dist-packages/scipy/stats/_morestats.py:3189: RuntimeWarning: invalid value encountered in scalar divide
  W = numer / denom

Welch's t-test¶

In [ ]:
tranformed_log = log_transformed_data.copy()
welch_data = tranformed_log[['Hospital Mortality', 'Max Heart Rate', 'Mean Heart Rate', 'Mean MAP', 'Mean Systolic Pressure', 'Mean Diastolic Pressure', 'Mean BUN', 'Max Hemoglobin', 'Mean Hemoglobin']]
In [ ]:
tranformed_log['Hospital Mortality'].value_counts()
Out[ ]:
0    10331
1     2158
Name: Hospital Mortality, dtype: int64
In [ ]:
# prompt: perform a welch's t-test on cont_all_df ['Hospital Mortality'] == 0 and cont_all_df ['Hospital Mortality'] == 1

for column in welch_data.columns[1:]:
    t_statistic, p_value = stats.ttest_ind(welch_data[column][welch_data['Hospital Mortality'] == 0], welch_data[column][welch_data['Hospital Mortality'] == 1], equal_var=False)
    print(column)
    # Print or use the results as needed
    print(f"Welch's T-test for {column}:")
    print(f"  T-statistic: {t_statistic}")
    print(f"  P-value: {p_value}")
    print("")

    if p_value < 0.05:
        print(f"The difference in {column} between survivors and non-survivors is statistically significant.")
        print("---------------------------------------------------------------------")
    else:
        print(f"There is no significant difference in {column} between survivors and non-survivors.")
        print("---------------------------------------------------------------------")

    result_dict['Variable'].append(column)
    result_dict['Data_Type'].append('Continuous')
    result_dict['Type_of_Test'].append('Welch\'s T-test')
    result_dict['P-value'].append(p_value)
Max Heart Rate
Welch's T-test for Max Heart Rate:
  T-statistic: -14.425352999038484
  P-value: 1.5137029803818268e-45

The difference in Max Heart Rate between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Heart Rate
Welch's T-test for Mean Heart Rate:
  T-statistic: -9.820534110448351
  P-value: 2.1132933705136722e-22

The difference in Mean Heart Rate between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean MAP
Welch's T-test for Mean MAP:
  T-statistic: 9.963466996165206
  P-value: 5.409360708044105e-23

The difference in Mean MAP between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Systolic Pressure
Welch's T-test for Mean Systolic Pressure:
  T-statistic: 11.317210031572678
  P-value: 4.935026168815372e-29

The difference in Mean Systolic Pressure between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Diastolic Pressure
Welch's T-test for Mean Diastolic Pressure:
  T-statistic: 8.723910851015027
  P-value: 4.528457004895649e-18

The difference in Mean Diastolic Pressure between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean BUN
Welch's T-test for Mean BUN:
  T-statistic: -27.18572952625232
  P-value: 4.778132353060818e-145

The difference in Mean BUN between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Max Hemoglobin
Welch's T-test for Max Hemoglobin:
  T-statistic: 9.42725880042387
  P-value: 8.444630921832603e-21

The difference in Max Hemoglobin between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Hemoglobin
Welch's T-test for Mean Hemoglobin:
  T-statistic: 0.8239296708080953
  P-value: 0.41004708014899816

There is no significant difference in Mean Hemoglobin between survivors and non-survivors.
---------------------------------------------------------------------

Mann-Whitney U test¶

In [ ]:
columns_to_exclude = ['Mean Heart Rate', 'Max Heart Rate','Mean MAP', 'Mean Systolic Pressure', 'Mean Diastolic Pressure', 'Mean BUN', 'Max Hemoglobin', 'Mean Hemoglobin']

# Dropping the specified columns
mann_cont_data = cont_all_df.drop(columns=columns_to_exclude)
In [ ]:
# Separate data into two groups based on target
survivors = mann_cont_data[mann_cont_data['Hospital Mortality'] == 0]
non_survivors = mann_cont_data[mann_cont_data['Hospital Mortality'] == 1]

# Perform Mann-Whitney U test for each numerical column
for column in mann_cont_data.columns[1:]:
    stat, p = stats.mannwhitneyu(survivors[column], non_survivors[column])
    print(column)
    # Print or use the results as needed
    print(f"Mann-Whitney U test for {column}:")
    print(f"  Statistic: {stat}")
    print(f"  P-value: {p}")
    print("")

    if p < 0.05:
        print(f"The difference in {column} between survivors and non-survivors is statistically significant.")
        print("---------------------------------------------------------------------")
    else:
        print(f"There is no significant difference in {column} between survivors and non-survivors.")
        print("---------------------------------------------------------------------")

    # Append results to the dictionary
    result_dict['Variable'].append(column)
    result_dict['Data_Type'].append('Continuous')
    result_dict['Type_of_Test'].append('Mann-Whitney U')
    result_dict['P-value'].append(p)
Age
Mann-Whitney U test for Age:
  Statistic: 8525675.5
  P-value: 2.1525320528626538e-66

The difference in Age between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Min Heart Rate
Mann-Whitney U test for Min Heart Rate:
  Statistic: 11076044.0
  P-value: 0.6405817553911917

There is no significant difference in Min Heart Rate between survivors and non-survivors.
---------------------------------------------------------------------
Max MAP
Mann-Whitney U test for Max MAP:
  Statistic: 11096635.5
  P-value: 0.7401547602044709

There is no significant difference in Max MAP between survivors and non-survivors.
---------------------------------------------------------------------
Min MAP
Mann-Whitney U test for Min MAP:
  Statistic: 14278470.0
  P-value: 5.902742296218549e-94

The difference in Min MAP between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Max Systolic Pressure
Mann-Whitney U test for Max Systolic Pressure:
  Statistic: 11397950.5
  P-value: 0.09964242690167496

There is no significant difference in Max Systolic Pressure between survivors and non-survivors.
---------------------------------------------------------------------
Min Systolic Pressure
Mann-Whitney U test for Min Systolic Pressure:
  Statistic: 14238452.0
  P-value: 1.3500531808846305e-91

The difference in Min Systolic Pressure between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Max Diastolic Pressure
Mann-Whitney U test for Max Diastolic Pressure:
  Statistic: 10874113.0
  P-value: 0.07300896056630023

There is no significant difference in Max Diastolic Pressure between survivors and non-survivors.
---------------------------------------------------------------------
Min Diastolic Pressure
Mann-Whitney U test for Min Diastolic Pressure:
  Statistic: 14159804.0
  P-value: 3.8903901136352047e-87

The difference in Min Diastolic Pressure between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Max Temperature
Mann-Whitney U test for Max Temperature:
  Statistic: 11815671.5
  P-value: 1.1390806265114146e-05

The difference in Max Temperature between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Min Temperature
Mann-Whitney U test for Min Temperature:
  Statistic: 12427119.5
  P-value: 4.344242710802271e-17

The difference in Min Temperature between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Temperature
Mann-Whitney U test for Mean Temperature:
  Statistic: 12364791.5
  P-value: 1.3125499468833547e-15

The difference in Mean Temperature between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Max Lactate
Mann-Whitney U test for Max Lactate:
  Statistic: 8190966.0
  P-value: 6.177760645508349e-84

The difference in Max Lactate between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Min Lactate
Mann-Whitney U test for Min Lactate:
  Statistic: 7130572.5
  P-value: 1.4044905844472804e-153

The difference in Min Lactate between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Lactate
Mann-Whitney U test for Mean Lactate:
  Statistic: 7683017.5
  P-value: 1.694887046403263e-114

The difference in Mean Lactate between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Max pH
Mann-Whitney U test for Max pH:
  Statistic: 12850534.0
  P-value: 4.311982639570609e-29

The difference in Max pH between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Min pH
Mann-Whitney U test for Min pH:
  Statistic: 13348631.0
  P-value: 2.1446956365998007e-47

The difference in Min pH between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean pH
Mann-Whitney U test for Mean pH:
  Statistic: 13346745.5
  P-value: 2.205225301421819e-47

The difference in Mean pH between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Max Glucose
Mann-Whitney U test for Max Glucose:
  Statistic: 9378364.5
  P-value: 3.5942111872237883e-31

The difference in Max Glucose between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Min Glucose
Mann-Whitney U test for Min Glucose:
  Statistic: 9174274.5
  P-value: 2.2841251393337134e-38

The difference in Min Glucose between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Glucose
Mann-Whitney U test for Mean Glucose:
  Statistic: 8716584.0
  P-value: 2.594076017013178e-57

The difference in Mean Glucose between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Max WBC
Mann-Whitney U test for Max WBC:
  Statistic: 10015821.0
  P-value: 1.1117900361048767e-13

The difference in Max WBC between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Min WBC
Mann-Whitney U test for Min WBC:
  Statistic: 9939952.5
  P-value: 2.281025809249482e-15

The difference in Min WBC between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean WBC
Mann-Whitney U test for Mean WBC:
  Statistic: 9949532.5
  P-value: 3.781638033820371e-15

The difference in Mean WBC between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Max BUN
Mann-Whitney U test for Max BUN:
  Statistic: 6910567.0
  P-value: 2.0237352811575977e-170

The difference in Max BUN between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Min BUN
Mann-Whitney U test for Min BUN:
  Statistic: 7025303.0
  P-value: 1.7651631567022414e-161

The difference in Min BUN between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Max Creatinine
Mann-Whitney U test for Max Creatinine:
  Statistic: 7640747.0
  P-value: 6.435699951912311e-118

The difference in Max Creatinine between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Min Creatinine
Mann-Whitney U test for Min Creatinine:
  Statistic: 8025284.5
  P-value: 3.9910953959453173e-94

The difference in Min Creatinine between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Creatinine
Mann-Whitney U test for Mean Creatinine:
  Statistic: 7753167.5
  P-value: 4.6757947488289196e-110

The difference in Mean Creatinine between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Min Hemoglobin
Mann-Whitney U test for Min Hemoglobin:
  Statistic: 10611817.0
  P-value: 0.0004403852646089298

The difference in Min Hemoglobin between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------

The following Mann-Whitney test is conducted for the variables that became normally dristributed after doing the log transformation. We created this subset to compare the result with the Welch T-Test results.

In [ ]:
subset_welch = cont_all_df.copy()
MW_comparison = subset_welch[['Hospital Mortality', 'Max Heart Rate', 'Mean Heart Rate', 'Mean MAP', 'Mean Systolic Pressure', 'Mean Diastolic Pressure', 'Mean BUN', 'Max Hemoglobin', 'Mean Hemoglobin']]

# Separate data into two groups based on target
survivors = MW_comparison[MW_comparison['Hospital Mortality'] == 0]
non_survivors = MW_comparison[MW_comparison['Hospital Mortality'] == 1]

# Perform Mann-Whitney U test for each numerical column
for column in MW_comparison.columns[1:]:
    stat, p = stats.mannwhitneyu(survivors[column], non_survivors[column])
    print(column)
    # Print or use the results as needed
    print(f"Mann-Whitney U test for {column}:")
    print(f"  Statistic: {stat}")
    print(f"  P-value: {p}")
    print("")

    if p < 0.05:
        print(f"The difference in {column} between survivors and non-survivors is statistically significant.")
        print("---------------------------------------------------------------------")
    else:
        print(f"There is no significant difference in {column} between survivors and non-survivors.")
        print("---------------------------------------------------------------------")
Max Heart Rate
Mann-Whitney U test for Max Heart Rate:
  Statistic: 8640638.5
  P-value: 7.524132792815273e-61

The difference in Max Heart Rate between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Heart Rate
Mann-Whitney U test for Mean Heart Rate:
  Statistic: 9409815.0
  P-value: 3.947274866302104e-30

The difference in Mean Heart Rate between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean MAP
Mann-Whitney U test for Mean MAP:
  Statistic: 12733640.5
  P-value: 2.1228661641088843e-25

The difference in Mean MAP between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Systolic Pressure
Mann-Whitney U test for Mean Systolic Pressure:
  Statistic: 13073048.5
  P-value: 1.224837336030013e-36

The difference in Mean Systolic Pressure between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Diastolic Pressure
Mann-Whitney U test for Mean Diastolic Pressure:
  Statistic: 12469046.0
  P-value: 4.0347388451056675e-18

The difference in Mean Diastolic Pressure between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean BUN
Mann-Whitney U test for Mean BUN:
  Statistic: 6923118.0
  P-value: 2.922259721412044e-169

The difference in Mean BUN between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Max Hemoglobin
Mann-Whitney U test for Max Hemoglobin:
  Statistic: 12656776.0
  P-value: 3.714891731259355e-23

The difference in Max Hemoglobin between survivors and non-survivors is statistically significant.
---------------------------------------------------------------------
Mean Hemoglobin
Mann-Whitney U test for Mean Hemoglobin:
  Statistic: 11249336.0
  P-value: 0.5023332041707467

There is no significant difference in Mean Hemoglobin between survivors and non-survivors.
---------------------------------------------------------------------

Spearman Correlation¶

In [ ]:
# # Create an empty dictionary to store the correlation results
# corr_results_cont = {'Variable_1': [], 'Variable_2': [], 'Correlation': [], 'Group': []}
# cols_to_exclude_s = [col for col in cont_all_df.columns if 'Min_x' in col or 'Max_x' in col]
# spearman_matrix = cont_all_df.drop(cols_to_exclude_s, axis=1)

# # Iterate over the ordinal variables, excluding 'Hospital Mortality'
# for col1 in spearman_matrix.columns[1:-1]:
#     for col2 in spearman_matrix.columns[2:]:
#         if col1 != col2:
#             # Separate data into two groups based on target
#             survivors = spearman_matrix[spearman_matrix['Hospital Mortality'] == 0]
#             non_survivors = spearman_matrix[spearman_matrix['Hospital Mortality'] == 1]

#             # Calculate Spearman's correlation coefficient for each group
#             corr_survivors, _ = stats.spearmanr(survivors[col1], survivors[col2])
#             corr_non_survivors, _ = stats.spearmanr(non_survivors[col1], non_survivors[col2])

#             # Store the results
#             corr_results_cont['Variable_1'].append(col1)
#             corr_results_cont['Variable_2'].append(col2)
#             corr_results_cont['Correlation'].append(corr_survivors)
#             corr_results_cont['Group'].append('Survivors')

#             corr_results_cont['Variable_1'].append(col1)
#             corr_results_cont['Variable_2'].append(col2)
#             corr_results_cont['Correlation'].append(corr_non_survivors)
#             corr_results_cont['Group'].append('Non-Survivors')

# # Create a DataFrame from the results
# corr_df_cont = pd.DataFrame(corr_results_cont)

# # Print the DataFrame
# # print(corr_df_cont)

# import pandas as pd
# from scipy import stats

# Create an empty dictionary to store the correlation results
corr_results_cont = {'Variable_1': [], 'Variable_2': [], 'Test': [],'Correlation': [], 'P_Value': [], 'Group': []}
cols_to_exclude_s = [col for col in cont_all_df.columns if 'Min_x' in col or 'Max_x' in col]
spearman_matrix = cont_all_df.drop(cols_to_exclude_s, axis=1)

# Iterate over the ordinal variables, excluding 'Hospital Mortality'
for col1 in spearman_matrix.columns[1:-1]:
    for col2 in spearman_matrix.columns[2:]:
        if col1 != col2:
            # Separate data into two groups based on target
            survivors = spearman_matrix[spearman_matrix['Hospital Mortality'] == 0]
            non_survivors = spearman_matrix[spearman_matrix['Hospital Mortality'] == 1]

            # Calculate Spearman's correlation coefficient and p-value for each group
            corr_survivors, p_survivors = stats.spearmanr(survivors[col1], survivors[col2])
            corr_non_survivors, p_non_survivors = stats.spearmanr(non_survivors[col1], non_survivors[col2])

            # Store the results
            corr_results_cont['Variable_1'].append(col1)
            corr_results_cont['Variable_2'].append(col2)
            corr_results_cont['Correlation'].append(corr_survivors)
            corr_results_cont['P_Value'].append(p_survivors)
            corr_results_cont['Group'].append('Survivors')
            corr_results_cont['Test'].append('Spearman')

            corr_results_cont['Variable_1'].append(col1)
            corr_results_cont['Variable_2'].append(col2)
            corr_results_cont['Correlation'].append(corr_non_survivors)
            corr_results_cont['P_Value'].append(p_non_survivors)
            corr_results_cont['Group'].append('Non-Survivors')
            corr_results_cont['Test'].append('Spearman')

# Create a DataFrame from the results
corr_df_cont = pd.DataFrame(corr_results_cont)
In [ ]:
# filter by -1 to -.5 and .5 to 1

# Filter the correlation values
filtered_corr = corr_df_cont[(corr_df_cont['Correlation'] <= -0.5) | (corr_df_cont['Correlation'] >= 0.5)]
print(filtered_corr.shape)

# Sort the filtered results by Correlation in descending order
sorted_corr = filtered_corr.sort_values(by='Correlation', ascending=False)

sorted_corr['Correlation'] = sorted_corr['Correlation'].round(4)

sorted_corr['Second_Word_Variable_1'] = sorted_corr['Variable_1'].str.split().str[1]
sorted_corr['Second_Word_Variable_2'] = sorted_corr['Variable_2'].str.split().str[1]

sorted_corr = sorted_corr[sorted_corr['Second_Word_Variable_1'] != sorted_corr['Second_Word_Variable_2']]
print(sorted_corr.shape)

# Drop the helper column 'Second_Word_Variable_2'
sorted_corr = sorted_corr.drop(['Second_Word_Variable_2','Second_Word_Variable_1'], axis=1)

sorted_corr.reset_index(drop=True, inplace=True)

sorted_corr = sorted_corr[sorted_corr.index % 2 != 0]

print(sorted_corr.shape)

print(sorted_corr)
(204, 6)
(90, 8)
(45, 6)
                 Variable_1               Variable_2      Test  Correlation  \
1                  Mean MAP  Mean Diastolic Pressure  Spearman       0.8582   
3                  Mean MAP  Mean Diastolic Pressure  Spearman       0.8569   
5    Min Diastolic Pressure                  Min MAP  Spearman       0.8189   
7    Min Diastolic Pressure                  Min MAP  Spearman       0.8148   
9     Min Systolic Pressure                  Min MAP  Spearman       0.7987   
11                 Mean MAP   Mean Systolic Pressure  Spearman       0.7556   
13                  Max MAP   Max Diastolic Pressure  Spearman       0.7452   
15   Max Diastolic Pressure                  Max MAP  Spearman       0.7362   
17          Mean Creatinine                  Max BUN  Spearman       0.7351   
19          Mean Creatinine                 Mean BUN  Spearman       0.7328   
21           Min Creatinine                  Min BUN  Spearman       0.7320   
23                 Mean BUN           Min Creatinine  Spearman       0.7314   
25           Max Creatinine                  Max BUN  Spearman       0.7298   
27           Min Creatinine                  Max BUN  Spearman       0.7186   
29                 Mean BUN          Mean Creatinine  Spearman       0.7161   
31           Max Creatinine                 Mean BUN  Spearman       0.7145   
33                  Min MAP    Min Systolic Pressure  Spearman       0.7143   
35          Mean Creatinine                  Min BUN  Spearman       0.7138   
37                  Max BUN           Max Creatinine  Spearman       0.7094   
39           Min Creatinine                  Min BUN  Spearman       0.7048   
41                  Max BUN          Mean Creatinine  Spearman       0.7046   
43                 Mean MAP   Mean Systolic Pressure  Spearman       0.7038   
45                  Min BUN          Mean Creatinine  Spearman       0.7038   
47           Max Creatinine                 Mean BUN  Spearman       0.7032   
49                 Mean BUN           Min Creatinine  Spearman       0.6957   
51                  Min BUN           Max Creatinine  Spearman       0.6833   
53    Max Systolic Pressure                  Max MAP  Spearman       0.6739   
55           Max Creatinine                  Min BUN  Spearman       0.6734   
57                  Max BUN           Min Creatinine  Spearman       0.6680   
59                  Max MAP    Max Systolic Pressure  Spearman       0.6467   
61   Min Diastolic Pressure    Min Systolic Pressure  Spearman       0.6334   
63   Min Diastolic Pressure                 Mean MAP  Spearman       0.6235   
65    Max Systolic Pressure                 Mean MAP  Spearman       0.6105   
67  Mean Diastolic Pressure                  Min MAP  Spearman       0.6095   
69   Max Diastolic Pressure                 Mean MAP  Spearman       0.6088   
71   Min Diastolic Pressure                 Mean MAP  Spearman       0.6041   
73   Max Diastolic Pressure                 Mean MAP  Spearman       0.5847   
75                  Max MAP  Mean Diastolic Pressure  Spearman       0.5596   
77   Max Diastolic Pressure    Max Systolic Pressure  Spearman       0.5512   
79   Mean Systolic Pressure                  Min MAP  Spearman       0.5493   
81   Min Diastolic Pressure    Min Systolic Pressure  Spearman       0.5186   
83                 Mean MAP    Min Systolic Pressure  Spearman       0.5106   
85  Mean Diastolic Pressure                  Min MAP  Spearman       0.5080   
87    Min Systolic Pressure                 Mean MAP  Spearman       0.5078   
89              Max Lactate                   Min pH  Spearman      -0.5302   

          P_Value          Group  
1    0.000000e+00      Survivors  
3    0.000000e+00  Non-Survivors  
5    0.000000e+00  Non-Survivors  
7    0.000000e+00      Survivors  
9    0.000000e+00  Non-Survivors  
11   0.000000e+00  Non-Survivors  
13   0.000000e+00  Non-Survivors  
15   0.000000e+00      Survivors  
17   0.000000e+00  Non-Survivors  
19   0.000000e+00  Non-Survivors  
21   0.000000e+00  Non-Survivors  
23   0.000000e+00  Non-Survivors  
25   0.000000e+00  Non-Survivors  
27   0.000000e+00  Non-Survivors  
29   0.000000e+00      Survivors  
31   0.000000e+00  Non-Survivors  
33   0.000000e+00      Survivors  
35   0.000000e+00  Non-Survivors  
37   0.000000e+00      Survivors  
39   0.000000e+00      Survivors  
41   0.000000e+00      Survivors  
43   0.000000e+00      Survivors  
45   0.000000e+00      Survivors  
47   0.000000e+00      Survivors  
49   0.000000e+00      Survivors  
51  8.287704e-297  Non-Survivors  
53  9.774152e-286  Non-Survivors  
55   0.000000e+00      Survivors  
57   0.000000e+00      Survivors  
59   0.000000e+00      Survivors  
61  2.256757e-242  Non-Survivors  
63   0.000000e+00      Survivors  
65  1.279377e-220  Non-Survivors  
67   0.000000e+00      Survivors  
69   0.000000e+00      Survivors  
71  7.549115e-215  Non-Survivors  
73  4.401237e-198  Non-Survivors  
75  4.030514e-178  Non-Survivors  
77  8.346475e-172  Non-Survivors  
79  2.152481e-170  Non-Survivors  
81   0.000000e+00      Survivors  
83   0.000000e+00      Survivors  
85  6.528197e-142  Non-Survivors  
87  8.496490e-142  Non-Survivors  
89  9.810846e-157  Non-Survivors  
In [ ]:
cont_spear_df = cont_all_df[["Hospital Mortality","Mean Diastolic Pressure", "Mean Systolic Pressure", "Mean BUN", "Mean Creatinine", "Mean MAP","Max Lactate","Min pH"]]

cont_spear_df.head(5)
Out[ ]:
Hospital Mortality Mean Diastolic Pressure Mean Systolic Pressure Mean BUN Mean Creatinine Mean MAP Max Lactate Min pH
0 0 55.720000 102.960000 44.75 2.65 75.692812 8.8 7.26
1 1 79.525000 159.375000 16.50 1.30 98.850000 2.7 7.39
2 1 64.885714 128.228571 33.50 1.50 88.200000 15.1 7.02
9 1 59.923077 114.153846 12.00 0.80 79.923077 1.5 7.46
13 0 62.350000 94.300000 48.50 2.20 73.000005 2.8 7.32
In [ ]:
import seaborn as sns

sns.pairplot(data=cont_spear_df, hue="Hospital Mortality")
plt.show()
No description has been provided for this image

Density plot of continuous variables¶

Click here

Ordinal Variables

Mann-Whitney U test¶

In [ ]:
for column in ordinal_df.columns:
    if column != 'Hospital Mortality':
        stat, p = stats.mannwhitneyu(ordinal_df[column][ordinal_df['Hospital Mortality'] == 0],
                                     ordinal_df[column][ordinal_df['Hospital Mortality'] == 1])
        print(f"Mann-Whitney U test for {column}:")
        print(f"  Statistic: {stat}")
        print(f"  p-value: {p}")
        print("---------------------------------------------------------------------")

            # Append results to the dictionary
        result_dict['Variable'].append(column)
        result_dict['Data_Type'].append('Ordinal')
        result_dict['Type_of_Test'].append('Mann-Whitney U')
        result_dict['P-value'].append(p)

        # result_dict_n['Variable'].append(column)
        # result_dict_n['Data_Type'].append('Ordinal')
        # result_dict_n['Type_of_Test'].append('Mann-Whitney U')
        # result_dict_n['P-value'].append(p)
Mann-Whitney U test for SAPS II:
  Statistic: 5016710.5
  p-value: 0.0
---------------------------------------------------------------------
Mann-Whitney U test for SOFA:
  Statistic: 6833386.5
  p-value: 3.564756196880308e-178
---------------------------------------------------------------------
Mann-Whitney U test for OASIS:
  Statistic: 5437509.0
  p-value: 6.727799650542416e-308
---------------------------------------------------------------------

Spearman Correlation¶

Performed Spearman Correlation for the ordinal variables as supplementary

In [ ]:
# Create an empty dictionary to store the correlation results
corr_results = {'Variable_1': [], 'Variable_2': [], 'Test': [], 'Correlation': [], 'P_value': [], 'Group': []}

# Iterate over the ordinal variables, excluding 'Hospital Mortality'
for col1 in ordinal_df.columns[1:-1]:
    for col2 in ordinal_df.columns[2:]:
        if col1 != col2:
            # Separate data into two groups based on target
            survivors = ordinal_df[ordinal_df['Hospital Mortality'] == 0]
            non_survivors = ordinal_df[ordinal_df['Hospital Mortality'] == 1]

            # Calculate Spearman's correlation coefficient and p-value for each group
            corr_survivors, p_value_survivors = stats.spearmanr(survivors[col1], survivors[col2])
            corr_non_survivors, p_value_non_survivors = stats.spearmanr(non_survivors[col1], non_survivors[col2])

            # Store the results
            corr_results['Variable_1'].extend([col1, col1])
            corr_results['Variable_2'].extend([col2, col2])
            corr_results['Correlation'].extend([corr_survivors, corr_non_survivors])
            corr_results['P_value'].extend([p_value_survivors, p_value_non_survivors])
            corr_results['Group'].extend(['Survivors', 'Non-Survivors'])
            corr_results['Test'].extend(['Spearman', 'Spearman'])

# Create a DataFrame from the results
corr_df = pd.DataFrame(corr_results)

# Print the DataFrame
print(corr_df)
  Variable_1 Variable_2      Test  Correlation        P_value          Group
0    SAPS II       SOFA  Spearman     0.596778   0.000000e+00      Survivors
1    SAPS II       SOFA  Spearman     0.675700  7.755076e-288  Non-Survivors
2    SAPS II      OASIS  Spearman     0.598171   0.000000e+00      Survivors
3    SAPS II      OASIS  Spearman     0.674522  1.814579e-286  Non-Survivors
4       SOFA      OASIS  Spearman     0.360364   0.000000e+00      Survivors
5       SOFA      OASIS  Spearman     0.453597  5.383949e-110  Non-Survivors

Density plot of ordinal variables¶

In [ ]:
for column in ordinal_df.columns:
    if column != 'Hospital Mortality':
        # Create a figure and axes
        fig, ax = plt.subplots(figsize=(10, 6))

        # density plot for each group
        sns.kdeplot(data=df[df['Hospital Mortality'] == 0][column], ax = ax, label='Alive', fill=True, color = 'g')
        sns.kdeplot(data=df[df['Hospital Mortality'] == 1][column], ax = ax, label='Dead', fill=True, color = 'r')
        sns.kdeplot(data=df[column], label='Overall Classes',ax = ax, fill=True,color='b')


        plt.title(f'Density Plot for {column}')
        plt.xlabel(column)
        plt.ylabel('Density')

        plt.legend()


        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Scatter Plot¶

In [ ]:
# Create a scatter plot matrix for ordinal_df
sns.pairplot(ordinal_df, hue='Hospital Mortality', palette='Set1')

# Set the title of the plot
plt.suptitle('Scatter Plot Matrix for Ordinal Variables')

# Display the plot
plt.show()
No description has been provided for this image

Binary Variables

Chi-Square test¶

In [ ]:
for column in binary_df.columns[1:]:
    contingency_table = pd.crosstab(df['Hospital Mortality'], df[column])
    chi2, p, _, _ = chi2_contingency(contingency_table)
    print()
    print(contingency_table)
    print()
    print(f"Chi-square test between Hospital Mortality and {column}:")
    print(f"Chi2 value: {chi2}")
    print(f"P-value: {p}")
    print("--------------------------------------------------------")

    # Append results to the dictionary
    result_dict['Variable'].append(column)
    result_dict['Data_Type'].append('Categorical')
    result_dict['Type_of_Test'].append('Chi-Square')
    result_dict['P-value'].append(p)

    # result_dict_n['Variable'].append(column)
    # result_dict_n['Data_Type'].append('Categorical')
    # result_dict_n['Type_of_Test'].append('Chi-Square')
    # result_dict_n['P-value'].append(p)
Gender                 F     M
Hospital Mortality            
0                   3938  6393
1                    953  1205

Chi-square test between Hospital Mortality and Gender:
Chi2 value: 27.10758978619394
P-value: 1.9244091683381194e-07
--------------------------------------------------------

Uncomplicated Hypertension     0     1
Hospital Mortality                    
0                           5635  4696
1                           1366   792

Chi-square test between Hospital Mortality and Uncomplicated Hypertension:
Chi2 value: 55.189196979179414
P-value: 1.0946874261933157e-13
--------------------------------------------------------

Complicated Hypertension     0    1
Hospital Mortality                 
0                         9486  845
1                         1941  217

Chi-square test between Hospital Mortality and Complicated Hypertension:
Chi2 value: 7.838345176804054
P-value: 0.005114940941223975
--------------------------------------------------------

Uncomplicated Diabetes     0     1
Hospital Mortality                
0                       8273  2058
1                       1740   418

Chi-square test between Hospital Mortality and Uncomplicated Diabetes:
Chi2 value: 0.30699355812333073
P-value: 0.5795309446027188
--------------------------------------------------------

Complicated Diabetes     0    1
Hospital Mortality             
0                     9771  560
1                     2050  108

Chi-square test between Hospital Mortality and Complicated Diabetes:
Chi2 value: 0.5306520803096673
P-value: 0.4663328403623749
--------------------------------------------------------

Malignancy             0     1
Hospital Mortality            
0                   9265  1066
1                   1756   402

Chi-square test between Hospital Mortality and Malignancy:
Chi2 value: 118.04115471197295
P-value: 1.698274709058758e-27
--------------------------------------------------------

Hematologic Disease     0     1
Hospital Mortality             
0                    8820  1511
1                    1610   548

Chi-square test between Hospital Mortality and Hematologic Disease:
Chi2 value: 149.55075555628915
P-value: 2.1734758882909007e-34
--------------------------------------------------------

Metastasis             0    1
Hospital Mortality           
0                   9921  410
1                   1955  203

Chi-square test between Hospital Mortality and Metastasis:
Chi2 value: 111.94872842345087
P-value: 3.666710397229884e-26
--------------------------------------------------------

Peripheral Vascular Disease     0    1
Hospital Mortality                    
0                            9465  866
1                            1983  175

Chi-square test between Hospital Mortality and Peripheral Vascular Disease:
Chi2 value: 0.14043292179104155
P-value: 0.7078510082500947
--------------------------------------------------------

Hypothyroidism         0    1
Hospital Mortality           
0                   9442  889
1                   1986  172

Chi-square test between Hospital Mortality and Hypothyroidism:
Chi2 value: 0.8455723053468485
P-value: 0.3578079363406852
--------------------------------------------------------

Chronic Heart Failure     0     1
Hospital Mortality               
0                      7965  2366
1                      1539   619

Chi-square test between Hospital Mortality and Chronic Heart Failure:
Chi2 value: 32.494673323998754
P-value: 1.1951967205829576e-08
--------------------------------------------------------

Stroke                 0    1
Hospital Mortality           
0                   9810  521
1                   1985  173

Chi-square test between Hospital Mortality and Stroke:
Chi2 value: 29.512849417030985
P-value: 5.5547217143098614e-08
--------------------------------------------------------

Liver Disease          0    1
Hospital Mortality           
0                   9430  901
1                   1721  437

Chi-square test between Hospital Mortality and Liver Disease:
Chi2 value: 246.83972334726226
P-value: 1.2688950646354187e-55
--------------------------------------------------------

Sepsis                 0     1
Hospital Mortality            
0                   8872  1459
1                   1356   802

Chi-square test between Hospital Mortality and Sepsis:
Chi2 value: 637.6686458256187
P-value: 1.0739326128684081e-140
--------------------------------------------------------

Any Organ Failure      0     1
Hospital Mortality            
0                   5039  5292
1                    423  1735

Chi-square test between Hospital Mortality and Any Organ Failure:
Chi2 value: 616.2527393356861
P-value: 4.884011258785986e-136
--------------------------------------------------------

Severe Respiratory Failure     0    1
Hospital Mortality                   
0                           9699  632
1                           1794  364

Chi-square test between Hospital Mortality and Severe Respiratory Failure:
Chi2 value: 279.62518427423595
P-value: 9.063066574953642e-63
--------------------------------------------------------

Severe Coagulation Failure      0   1
Hospital Mortality                   
0                           10298  33
1                            2110  48

Chi-square test between Hospital Mortality and Severe Coagulation Failure:
Chi2 value: 97.58692892932098
P-value: 5.1543055964643073e-23
--------------------------------------------------------

Severe Liver Failure      0   1
Hospital Mortality             
0                     10260  71
1                      2065  93

Chi-square test between Hospital Mortality and Severe Liver Failure:
Chi2 value: 177.95721250164928
P-value: 1.353497596467662e-40
--------------------------------------------------------

Severe Cardiovascular Failure     0     1
Hospital Mortality                       
0                              9267  1064
1                              1383   775

Chi-square test between Hospital Mortality and Severe Cardiovascular Failure:
Chi2 value: 930.6517936766536
P-value: 2.1311389141756498e-204
--------------------------------------------------------

Severe Central Nervous System Failure     0    1
Hospital Mortality                              
0                                      9804  527
1                                      1949  209

Chi-square test between Hospital Mortality and Severe Central Nervous System Failure:
Chi2 value: 66.80535560379377
P-value: 2.9968284695977144e-16
--------------------------------------------------------

Severe Renal Failure     0    1
Hospital Mortality             
0                     9964  367
1                     1839  319

Chi-square test between Hospital Mortality and Severe Renal Failure:
Chi2 value: 431.49833268868963
P-value: 7.66968217562407e-96
--------------------------------------------------------

Respiratory Dysfunction     0     1
Hospital Mortality                 
0                        7728  2603
1                        1083  1075

Chi-square test between Hospital Mortality and Respiratory Dysfunction:
Chi2 value: 519.5454689895382
P-value: 5.314118235626247e-115
--------------------------------------------------------

Cardiovascular Dysfunction     0     1
Hospital Mortality                    
0                           9104  1227
1                           1381   777

Chi-square test between Hospital Mortality and Cardiovascular Dysfunction:
Chi2 value: 769.6863056545292
P-value: 2.103599616429413e-169
--------------------------------------------------------

Renal Dysfunction      0     1
Hospital Mortality            
0                   7861  2470
1                   1102  1056

Chi-square test between Hospital Mortality and Renal Dysfunction:
Chi2 value: 550.5302091270887
P-value: 9.652686387061328e-122
--------------------------------------------------------

Hematologic Dysfunction     0     1
Hospital Mortality                 
0                        9286  1045
1                        1709   449

Chi-square test between Hospital Mortality and Hematologic Dysfunction:
Chi2 value: 192.72721363359676
P-value: 8.073483031253256e-44
--------------------------------------------------------

Metabolic Dysfunction     0     1
Hospital Mortality               
0                      9289  1042
1                      1674   484

Chi-square test between Hospital Mortality and Metabolic Dysfunction:
Chi2 value: 252.36956043917155
P-value: 7.904304859920673e-57
--------------------------------------------------------

Neurologic Dysfunction     0    1
Hospital Mortality               
0                       9391  940
1                       1854  304

Chi-square test between Hospital Mortality and Neurologic Dysfunction:
Chi2 value: 48.97268661992234
P-value: 2.595517465197432e-12
--------------------------------------------------------

Box Plot of binary variables¶

In [ ]:
for column in cont_all_df.columns:
    if column != 'Hospital Mortality':
        sns.boxplot(x='Hospital Mortality', y=column, data=cont_all_df, hue='Hospital Mortality', legend=False)
        plt.title(f'Box Plot for {column} by Hospital Mortality')
        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Summary Table

Summary Table for Statistical tests conducted¶

In [ ]:
vital_signs = ['Max Heart Rate', 'Min Heart Rate', 'Mean Heart Rate', 'Max MAP', 'Min MAP', 'Mean MAP', 'Max Systolic Pressure', 'Min Systolic Pressure', 'Mean Systolic Pressure', 'Max Diastolic Pressure', 'Min Diastolic Pressure', 'Mean Diastolic Pressure', 'Max Temperature', 'Min Temperature', 'Mean Temperature']
demographic = ['Age', 'Gender']
diagnosis = ['Sepsis', 'Any Organ Failure', 'Severe Respiratory Failure', 'Severe Coagulation Failure', 'Severe Liver Failure', 'Severe Cardiovascular Failure', 'Severe Central Nervous System Failure', 'Severe Renal Failure', 'Respiratory Dysfunction', 'Cardiovascular Dysfunction', 'Renal Dysfunction', 'Hematologic Dysfunction', 'Metabolic Dysfunction', 'Neurologic Dysfunction']
severity = ['SAPS II', 'SOFA', 'OASIS']
lab = ['Max Lactate', 'Min Lactate', 'Mean Lactate', 'Max pH', 'Min pH', 'Mean pH', 'Max Glucose', 'Min Glucose', 'Mean Glucose', 'Max WBC', 'Min WBC', 'Mean WBC', 'Max BUN', 'Min BUN', 'Mean BUN', 'Max Creatinine', 'Min Creatinine', 'Mean Creatinine', 'Max Hemoglobin', 'Min Hemoglobin', 'Mean Hemoglobin']
history = ['Uncomplicated Hypertension', 'Complicated Hypertension', 'Uncomplicated Diabetes', 'Complicated Diabetes', 'Malignancy', 'Hematologic Disease', 'Metastasis', 'Peripheral Vascular Disease', 'Hypothyroidism', 'Chronic Heart Failure', 'Stroke', 'Liver Disease']
summary_of_tests = pd.DataFrame(result_dict)

def categorize_variable(variable):
    if variable in vital_signs:
        return 'Vital signs'
    elif variable in demographic:
        return 'Demographic'
    elif variable in diagnosis:
        return 'Diagnosis'
    elif variable in severity:
        return 'Severity'
    elif variable in lab:
        return 'Laboratory results'
    elif variable in history:
        return 'Medical history'
    else:
        return 'Other'

# Apply the function
summary_of_tests['Category'] = summary_of_tests['Variable'].apply(categorize_variable)
summary_of_tests.insert(1, 'Category', summary_of_tests.pop('Category'))
summary_of_tests.sort_values(by='Category', inplace=True)
# summary_of_tests['P-value'] = round(summary_of_tests['P-value'], 5)
print(summary_of_tests)
# print(type(summary_of_tests['P-value']))
                                 Variable            Category    Data_Type  \
40                                 Gender         Demographic  Categorical   
8                                     Age         Demographic   Continuous   
66                 Neurologic Dysfunction           Diagnosis  Categorical   
65                  Metabolic Dysfunction           Diagnosis  Categorical   
53                                 Sepsis           Diagnosis  Categorical   
55             Severe Respiratory Failure           Diagnosis  Categorical   
56             Severe Coagulation Failure           Diagnosis  Categorical   
57                   Severe Liver Failure           Diagnosis  Categorical   
54                      Any Organ Failure           Diagnosis  Categorical   
59  Severe Central Nervous System Failure           Diagnosis  Categorical   
60                   Severe Renal Failure           Diagnosis  Categorical   
61                Respiratory Dysfunction           Diagnosis  Categorical   
62             Cardiovascular Dysfunction           Diagnosis  Categorical   
63                      Renal Dysfunction           Diagnosis  Categorical   
58          Severe Cardiovascular Failure           Diagnosis  Categorical   
64                Hematologic Dysfunction           Diagnosis  Categorical   
28                                Max WBC  Laboratory results   Continuous   
29                                Min WBC  Laboratory results   Continuous   
30                               Mean WBC  Laboratory results   Continuous   
31                                Max BUN  Laboratory results   Continuous   
34                         Min Creatinine  Laboratory results   Continuous   
35                        Mean Creatinine  Laboratory results   Continuous   
36                         Min Hemoglobin  Laboratory results   Continuous   
27                           Mean Glucose  Laboratory results   Continuous   
32                                Min BUN  Laboratory results   Continuous   
26                            Min Glucose  Laboratory results   Continuous   
33                         Max Creatinine  Laboratory results   Continuous   
24                                Mean pH  Laboratory results   Continuous   
25                            Max Glucose  Laboratory results   Continuous   
6                          Max Hemoglobin  Laboratory results   Continuous   
5                                Mean BUN  Laboratory results   Continuous   
19                            Max Lactate  Laboratory results   Continuous   
7                         Mean Hemoglobin  Laboratory results   Continuous   
21                           Mean Lactate  Laboratory results   Continuous   
22                                 Max pH  Laboratory results   Continuous   
23                                 Min pH  Laboratory results   Continuous   
20                            Min Lactate  Laboratory results   Continuous   
52                          Liver Disease     Medical history  Categorical   
51                                 Stroke     Medical history  Categorical   
50                  Chronic Heart Failure     Medical history  Categorical   
49                         Hypothyroidism     Medical history  Categorical   
48            Peripheral Vascular Disease     Medical history  Categorical   
47                             Metastasis     Medical history  Categorical   
45                             Malignancy     Medical history  Categorical   
46                    Hematologic Disease     Medical history  Categorical   
43                 Uncomplicated Diabetes     Medical history  Categorical   
42               Complicated Hypertension     Medical history  Categorical   
41             Uncomplicated Hypertension     Medical history  Categorical   
44                   Complicated Diabetes     Medical history  Categorical   
39                                  OASIS            Severity      Ordinal   
38                                   SOFA            Severity      Ordinal   
37                                SAPS II            Severity      Ordinal   
1                         Mean Heart Rate         Vital signs   Continuous   
2                                Mean MAP         Vital signs   Continuous   
3                  Mean Systolic Pressure         Vital signs   Continuous   
4                 Mean Diastolic Pressure         Vital signs   Continuous   
18                       Mean Temperature         Vital signs   Continuous   
17                        Min Temperature         Vital signs   Continuous   
11                                Min MAP         Vital signs   Continuous   
9                          Min Heart Rate         Vital signs   Continuous   
10                                Max MAP         Vital signs   Continuous   
12                  Max Systolic Pressure         Vital signs   Continuous   
13                  Min Systolic Pressure         Vital signs   Continuous   
15                 Min Diastolic Pressure         Vital signs   Continuous   
14                 Max Diastolic Pressure         Vital signs   Continuous   
16                        Max Temperature         Vital signs   Continuous   
0                          Max Heart Rate         Vital signs   Continuous   

      Type_of_Test        P-value  
40      Chi-Square   1.924409e-07  
8   Mann-Whitney U   2.152532e-66  
66      Chi-Square   2.595517e-12  
65      Chi-Square   7.904305e-57  
53      Chi-Square  1.073933e-140  
55      Chi-Square   9.063067e-63  
56      Chi-Square   5.154306e-23  
57      Chi-Square   1.353498e-40  
54      Chi-Square  4.884011e-136  
59      Chi-Square   2.996828e-16  
60      Chi-Square   7.669682e-96  
61      Chi-Square  5.314118e-115  
62      Chi-Square  2.103600e-169  
63      Chi-Square  9.652686e-122  
58      Chi-Square  2.131139e-204  
64      Chi-Square   8.073483e-44  
28  Mann-Whitney U   1.111790e-13  
29  Mann-Whitney U   2.281026e-15  
30  Mann-Whitney U   3.781638e-15  
31  Mann-Whitney U  2.023735e-170  
34  Mann-Whitney U   3.991095e-94  
35  Mann-Whitney U  4.675795e-110  
36  Mann-Whitney U   4.403853e-04  
27  Mann-Whitney U   2.594076e-57  
32  Mann-Whitney U  1.765163e-161  
26  Mann-Whitney U   2.284125e-38  
33  Mann-Whitney U  6.435700e-118  
24  Mann-Whitney U   2.205225e-47  
25  Mann-Whitney U   3.594211e-31  
6   Welch's T-test   8.444631e-21  
5   Welch's T-test  4.778132e-145  
19  Mann-Whitney U   6.177761e-84  
7   Welch's T-test   4.100471e-01  
21  Mann-Whitney U  1.694887e-114  
22  Mann-Whitney U   4.311983e-29  
23  Mann-Whitney U   2.144696e-47  
20  Mann-Whitney U  1.404491e-153  
52      Chi-Square   1.268895e-55  
51      Chi-Square   5.554722e-08  
50      Chi-Square   1.195197e-08  
49      Chi-Square   3.578079e-01  
48      Chi-Square   7.078510e-01  
47      Chi-Square   3.666710e-26  
45      Chi-Square   1.698275e-27  
46      Chi-Square   2.173476e-34  
43      Chi-Square   5.795309e-01  
42      Chi-Square   5.114941e-03  
41      Chi-Square   1.094687e-13  
44      Chi-Square   4.663328e-01  
39  Mann-Whitney U  6.727800e-308  
38  Mann-Whitney U  3.564756e-178  
37  Mann-Whitney U   0.000000e+00  
1   Welch's T-test   2.113293e-22  
2   Welch's T-test   5.409361e-23  
3   Welch's T-test   4.935026e-29  
4   Welch's T-test   4.528457e-18  
18  Mann-Whitney U   1.312550e-15  
17  Mann-Whitney U   4.344243e-17  
11  Mann-Whitney U   5.902742e-94  
9   Mann-Whitney U   6.405818e-01  
10  Mann-Whitney U   7.401548e-01  
12  Mann-Whitney U   9.964243e-02  
13  Mann-Whitney U   1.350053e-91  
15  Mann-Whitney U   3.890390e-87  
14  Mann-Whitney U   7.300896e-02  
16  Mann-Whitney U   1.139081e-05  
0   Welch's T-test   1.513703e-45  
In [ ]:
summary_of_tests['Category'].value_counts()
Out[ ]:
Laboratory results    21
Vital signs           15
Diagnosis             14
Medical history       12
Severity               3
Demographic            2
Name: Category, dtype: int64
In [ ]:
# Export the DataFrame to an Excel file
summary_of_tests.to_excel('summary_of_tests.xlsx', index=False)